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DESIGN OF POLYKETIDE SYNTHASE GENES 

FIELD OF INVENTION 

The present invention provides methods for the analysis of polyketides and the design of 
polyketide synthase genes* The invention relates to the fields of computational analysis, 
chemistry, molecular biology, and medicine. * 

BACKGROUND OF THE INVENTION 

The class of compounds known as polyketides is a large family of diverse compounds 
synthesized primarily from 2 -carbon unit building block compounds through a series of 
condensations and subsequent modifications. Polyketides occur in many types of organisms, 
including fungi and mycelial bacteria such as the actinomycetes. There are a wide variety of 
polyketide structures, and the class of polyketides encompasses numerous compounds with 
diverse activities. Epothilone, erythromycin, FK-506, FK-520, megalomicin, narbomycin, 
oleandomycin, picromycin, rapamycin, spinocyn, and tylosin are examples of such compounds. 

Given the difficulty in producing polyketide compounds by traditional chemical 
methodology, and the typically low production of polyketides in wild type cells, there as been 
considerable interest in finding improved or alternate means to produce polyketide compounds. 
See PCT Publication Nos. WO 95/08548; WO 96/40968; WO 97/02358; and 98/27203; Unites 
States Patent Nos. 5,962,290; 5,672,491; and 5,712,146; Fu et aL, Biochemistry 33: 9321-9326 
(1994); McDaniel et al y Science 262:1546-1555 (1993); and Rohr, Angew. Chem. Int. Ed. Engl. 
34(8): 881-888 (1995), each of which is incorporated herein by reference. 

Polyketides are synthesized in nature by polyketide synthase (PKS) enzymes. These 
enzymes, which are complexes of multiple large proteins, are similar to the synthases that 
catalyze condensation of 2-carbon unit building block compounds in the biosynthesis of fatty 
acids. The genes that encode PKS enzymes usually consist of three or more open reading 
frames (ORFs). Two major types of PKS enzymes are known that differ in their composition 
and mode of synthesis. These two major types of PKS enzymes are commonly referred to as 
Type I or "modular" and Type II or "iterative" PKS enzymes. 
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Modular PKSs produce many different polyketides, including a large number of 12-, 14-, 
and 16-membered macrolide antibiotics including erythromycin, megalomicin, methymycin, 
narbomycin, oleandomycin, piciomycin, and tylosiiL Each ORF of a modular PKS can 
comprise one, two, or more '^modules" of ketosynthase activity, each module of which consists 
of at least two (if a loading module) and more typically three (for the simplest extender module) 
or more enzymatic activities or "domains." These large multifunctional enzymes (>300,000 
fcDa) catalyze the biosynthesis of polyketide macrolactones through multistep pathways 
involving decarboxylative condensations between acyl thioesters followed by cycles of varying 
j5-carbon processing activities (see O'Hagan, D., The polyketide metabolites, B. Horwood, New 
York, 1991, \\1iich is mcorporated herein by reference). 

During the past half decade, the study of modular PKS function and specificity has been 
greatly facilitated by the plasmid-based Streptomyces codicolor expression system developed 
with the 6-deoxyerythronolide B (6-dEB) synthase (DEBS) genes (see Kao et a/., Science, 265: 
509-512 (1994), McDaniel et al 9 Science 262: 1546-1557 (1993), and U.S. Patent Nos. 
5,672,491 and 5,712,146, each of which is incorporated herein by reference). The advantages to 
this plasmid-based genetic system for DEBS are that it overcomes the tedious and limited 
techniques for manipulating the natural DEBS host organism, Saccharopolyspora erythraea, 
allows more facile construction of recombinant PKSs, and reduces the complexity of PKS 
analysis by providing a "clean" host background. This system also expedited construction of a 
combinatorial modular polyketide library in Streptomyces (see PCT publication No. WO 
98/493 1 5, incorporated herein by reference). 

The ability to control aspects of polyketide biosynthesis, such as monomer selection and 
degree of /3-carbon processing, by genetic manipulation of PKSs has stimulated great interest in 
the combinatorial engineering of novel antibiotics (see Hutchinson, Curr. Opin* Microbiol. 1: 
3 19-329 (1993); Carreras and Santi, Curr. Opin. Biotech. 9: 403-41 1 (1998); and U.S. Patent 
Nos. 5,962,290; 5,712,146; and 5,672,491, each of which is incorporated herein by reference). 
This interest has resulted in the cloning, analysis, and manipulation by recombinant DNA 
technology of genes that encode PKS enzymes. The resulting technology allows one to 
manipulate a known PKS gene cluster either to produce the polyketide synthesized by that PKS 
at higher levels than occur in nature or in hosts that otherwise do not produce the polyketide. 
The technology also allows one to produce molecules that are structurally related to, but distinct 
from, the polyketides produced from known PKS gene clusters. 
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Polyketides are assembled by polyketide synthases through successive condensations of 
activated coenzyme-A thioester monomers derived from small organic acids such as acetate, 
propionate, and butyrate. Active sites required for condensation include an acyltransferase 
(AT), acyl carrier protein (ACP), and beta-ketoacylsynthase (KS). Each condensation cycle 
results in a /S-keto group that undergoes all, some, or none of a series of processing activities. 
Active sites that perform these reactions include a ketoreductase (KR), dehydratase (DH), and 
enoylreductase (ER). Thus, the absence of any beta-keto processing domain results in the 
presence of a ketone, a KR alone gives rise to a hydroxyl, a KR and DH result in an alkene, 
while a KR, DH, and ER combination leads to complete reduction to an alkane. After assembly 
of the polyketide chain, the molecule typically undergoes cyclization(s) and post-PKS 
modification (e.g. glycosylation, oxidation, acylation) to achieve the final active compound. 

To illustrate the synthesis of a macrolide by a modular PKS (see Cane et aL, Science 
282: 63 (1 998), incorporated herein by reference), one can refer to the PKS that produces the 
erythromycin polyketide (6-deoxyerythronolide B synthase or DEBS; see U.S. Patent No. 
5,824,513, incorporated herein by reference). In the modular DEBS PKS enzyme, the enzymatic 
steps for each round of condensation and reduction are encoded within a single Module" of the 
polypeptide (i.e., one distinct module for every condensation cycle). As shown in Figure 1, 
DEBS consists of a loading module and 6 extender modules and a chain terminating thioesterase 
(TE) domain within three extremely large polypeptides encoded by three open reading frames 
(ORFs, designated eryAI, eryAII, and eryAIII). 

Each of the three polypeptide subimits of DEBS (DEB1, DEBS2, and DEBS3 in 
Figure 1) contains 2 extender modules. DEBS 1 additionally contains the loading module, and 
DEBS3 contains the TE domain. Collectively, these proteins catalyze the condensation and 
appropriate reduction of one propionyl CoA starter unit and six methylmalonyl CoA extender 
units. Modules 1, 2, 5, and 6 contain KR domains; module 4 contains a complete set; 
KR/DH/ER, of reductive and dehydratase domains; and module 3 contains no functional 
reductive domain. Following the condensation and appropriate dehydration and reduction 
reactions, the enzyme bound intermediate is lactonized by the TE at the end of extender module 
6 to form 6-dEB (compound 1 in Figure 1). 



More particularly, the loading module of DEBS consists of two domains, an acyl- 
transferase (AT) domain and an acyl carrier protein (ACP) domain. In other PKS enzymes, the 
loading module is not composed of an AT and an ACP but instead utilizes a partially inactivated 
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KS, an AT, and an ACP. This partially inactivated KS is in most instances called KS Q , where 
the superscript letter is the abbreviation for the amino acid, glutamine, that is present instead of a 
cysteine in the active site that is believed to be required for condensation activity. Although the 
KS Q domain lacks condensation activity, it retains decarboxylase activity. The AT domain of 
the loading module recognizes a particular acyi-CoA (propionyl for DEBS, which can also 
accept acetyl) and transfers it as a thiol ester to the ACP of the loading module. Concurrently, 
the AT on each of the extender modules recognizes a particular extender-CoA (methyhnalonyl 
for DEBS) and transfers it to the ACP of that module to form a thioester. Once the PKS is 
primed with acyl- and malonyl-ACPs, the acyl group of the loading module migrates to form a 
thiol ester (trans-esterification) at the KS of the first extender module; at this stage, extender 
module 1 possesses an acyl-KS and a methyhnalonyl ACP. The acyl group derived from the 
loading module is then covalently attached to the alpha-caibon of the malonyl group to form a 
carbon-carbon bond, driven by concomitant decarboxylation, and generating a new acyl-ACP 
that has a backbone two carbons longer than the loading unit (elongation or extension). The 
growing polyketide chain (various intermediates are shown in Figure 1) is transferred from the 
ACP to the KS of the next module, and the process continues. 

The polyketide chain, growing by two carbons each module, is sequentially passed as a 
covalently bound thiol ester from module to module, in an assembly line-like process. The 
carbon chain produced by this process alone would possess a ketone at every other carbon atom, 
producing a polyketone, from which the name polyketide arises. Commonly, however, 
additional enzymatic activities modify the beta keto group of the polyketide chain to which the 
two carbon unit has been added before the chain is transferred to the next module. Modules may 
contain additional enzymatic activities as well, such as methyl transferase domains, but there are 
no such additional activities in DEBS. 

Once a polyketide chain traverses the final extender module of a modular PKS, it 
encounters the releasing domain or thioesterase found at the carboxyl end of most PKSs. Here, 
the polyketide is cleaved from the enzyme and cyclyzed. The resulting polyketide can be 
modified further by tailoring or modification enzymes; these enzymes add carbohydrate groups 
or methyl groups, or make other modifications, i.e., oxidation or reduction, on the polyketide 
core molecule. For example, the final steps in conversion of 6-dEB to erythromycin A include 
the actions of a number of modification enzymes, such asf C-6 hydroxylation, attachment of 
myc arose and desosamine sugars, C-12 hydroxylation (which produces erythromycin C), and 
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conversion of mycarose to cladinose via O-methylation. These modifications in various 
combinations result in erythromycins A (compound 2 in Figure 1), B, C, and D. 

While the detailed understanding of the mechanisms by which PKS enzymes function 
and the development of methods for manipulating PKS genes have facilitated the creation of 
novel polyketides, there remain substantial impediments to the creation of novel polyketides by 
genetic engineering. One such impediment is the availability of PKS genes. Many polyketides 
are known but only a relatively small portion of the corresponding PKS genes have been cloned 
and are available for manipulation. Moreover, in many instances the producing organism for an 
interesting polyketide is obtainable only with great difficulty and expense, and techniques for its 
growth in the laboratory and production of the polyketide it produces are unknown or difficult or 
time-consuming to practice. Also, even if the PKS genes for a desired polyketide have been 
cloned, those genes may not serve to drive the level of production desired in a particular host 
celL 

If there were a method to produce a desired polyketide without having to access the 
genes that encode the PKS that produces the polyketide, then many of these difficulties could be 
ameliorated or avoided altogether. The present invention meets this need 



SUMMARY OF THE INVENTION 

In one embodiment, the present invention provides methods for the computational 
analysis of polyketides and the computer-assisted design of PKS genes. 

la a first aspect, the present invention provides a method for representing the structure of 
a polyketide and/or a PKS gene that encodes the PKS that produces the polyketide by 
alphanumeric symbols that facilitates computer assisted analysis. 

In a second aspect, the present invention provides a database of polyketides and 
corresponding PKS genes that can be rapidly searched and information extracted for a variety of 
applications. More particularly, this database can include, in one mode, all known polyketides; 
and in another mode, the polyketides, optionally including all intermediates, produced by all 
known PKS genes or a subset thereof. 
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In a third aspect, the present invention provides a method for predicting the structure of a 
PKS and its corresponding genes from the structure of a polyketide. 

In a fourth aspect, the present invention provides a method for designing novel PKS 
genes capable of producing a desired polyketide. This aspect of the invention is directed to the 
design and specification of PKS genes via the recombining of modules or portions of modules or 
sets of modules from already known and available PKS genes. In one mode, all possible PKS 
genes encoding a desired polyketide from a set of genes in a database are generated In another 
mode, only a subset of such possible PICS genes is generated based on one or more parameters 
selected by the user. More particularly, a rating system is provided to sort the PKS genes 
designed for a particular target polyketide based on any one or more of several criteria, including 
number of non-native module interfaces, number of non-native protein interfaces, and other 
parameters as more particularly described below or selected by the user. 

In another embodiment, the present invention provides methods and reagents for 
preparing novel PKS genes that encode PKS enzymes that produce a desired polyketide. 

In a first aspect, the present invention provides a library of recombinant DNA 
compounds, wherein each member of said library encodes a module of a PKS or portions of 
modules or sets of modules having a desired specificity, and the library as a whole encompasses 
all of the members of a desired class of specificities. 

In a second aspect, the present invention provides a method for assembling a PKS gene 
cluster that encodes a PKS that produces a desired polyketide from known and available PKS 
genes other than the naturally occurring PKS genes that produce the polyketide in nature. 

These and other embodiments, modes, and aspects of the invention are described in more 
detail in the following description, the examples, and claims set forth below. 

BRIEF DESCRIPTION OF THE FIGURES 

Figure 1 shows a schematic representation of the PKS enzyme that synthesizes 6- 
deoxyerythronolide B (6-dEB, compound 1). The PKS is composed of three proteins, DEBS1, 
DEBS2, and DEBS3, each of which is represented by an arrow and contains two or more 
modules. Each module is represented by a solid line, and the domains in each module are shown 
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inside the arrow. Various intermediates produced during the synthesis are also shown, as are the 
structures of erythromycins A (compound 2), B, and D resulting from modification of 6-dEB. 

Figure 2 shows an illustrative set of 2-carbon unit monomers present in macrocyclic 
polyketides; these monomers can be used to represent polyketide backbone diversity generated 
by commonly used starter and extender units (malonyl CoA and methylmalonyl Co A) and the 
condensation and reduction reactions mediated by PKS enzymes. 

Figure 3 shows a representation of 6-dEB by molecular graph, CHUCKLES notation, 
and SMILES notation. The CHUCKLES notation uses the 2-carbon unit monomers shown in 
Figure 2. .In the CHUCKLES notation, the order of attachment of monomers is designated by. 
the order in which monomers are listed, and the attachment points within the monomers are 
specified in their definitions. In the SMILES notation, adjacent monomers are attached via 
single (covalent) bonds depicted by dashes. The cyclization bond is represented by the index 1 
adjacent to the Start and Close monomers. 

Figure 4 is a flowchart and block flow diagram in five parts designated A-E, inclusive. 

Flowchart Figure 4A is a block flow diagram of a computer system to design a novel 
PKS (and corresponding genes). 

Flowchart Figure 4B is a block flow diagram wherein the "Computer Program" bloGk (2) 
of Flowchart Figure 4A is further defined. 

Flowchart Figure 4C is a block flow diagram wherein the 'Design novel hybrid PKS 
genes from library for TARGET 9 block of Flowchart Figure 4B is further defined. 

Flowchart Figure 4D is a block flow diagram wherein the "align TARGET with 
STARTER; copy to ALIGNMENT* block of Flowchart Figure 4C is further defined. 

Flowchart Figure 4E is a block flow diagram wherein the "Rate novel hybrid designs" 
block (3) of Flowchart Figure 4B is further defined. 

Figure 5 shows a flowchart of a matching method for the generation of the CHUCKLES 
strings used for all polyketides in a library. 
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DETAILED DESCRIPTION OF THE INVENTION 



Because polyketides synthesized by modular PKS genes are built by the enzymatically 
controlled addition of primarily 2-carbon unit monomers and, to a lesser extent, other more 
complex monomers, each polyketide may be represented as a string of 2-carbon unit and other 
monomers. These monomers represent the portion of the polyketide backbone structure as a 
result of the incorporation of various starter and extender units (malonate, methyl malonate, etc.) 
and the subsequent chemical reactions. 

These reactions include: 

(1) condensation reactions, of which there are three basic reactions: malonyl-CoA 
condensation arid methylmalonyl-CoA condensation with the branched methyl having either R 
or S stereochemistry; and 

(2) reduction reactions, of which there are five basic reactions: no reduction (ketone 
preserved), keto-reduction (to yield a hydroxyl having either R or S stereochemistry), 
dehydration (trans double bond), and enoyl-reduction (to yield a methylene). 

An illustrative set of the basic monomers that can be used to represent a polyketide 
structure (and their corresponding symbols) comprises: 

OH OH OH OH 





OH 





and 
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A miscellaneous monomer, Q, can be used to denote a portion of the polyketide structure that 
cannot be assigned by monomers A-N. 

The monomer set shown above and in Figure 2 does not represent the actual monomers 
incorporated during biosynthesis. Instead, these monomers include a carbon from two different 
biosynthetic monomers. This is best explained using a polyketide fragment depicted below. 




CH3 CH3 

The fragment includes two two-carbon units, i and i+1 and part of a third two-carbon .unit, i-1 
that were incorporated into the polyketide during biosynthesis. The i-th extender module / 
attaches the two carbon biosynthetic unit whose backbone carbons are designated as alphaj and 
beta,- and the second extender module attaches the two carbon biosynthetic unit whose backbone 
carbons are designated as alphas and beta,*] . Using the monomer set shown above, this 
fragment consists of monomer A (derived from the beta carbon added in module i+1 and the 
alpha carbon added in module i) and another monomer A (derived from the beta carbon added in 
module i and the alpha carbon added in module i+1). 




CH3 CH3 



1 I 1 l 

A A 

The fifth carbon designated beta* 1 remains unassigned and will depend on the identity of the 
two-carbon biosynthetic unit that is incorporated in the polyketide by module i+2. 

The set of monomers shown in Figure 2 can be expanded to include other starter and 
extender units, of which there are many. Such starter and extender units include, for example 
but without limitation, hydroxymalonate (e.g., niddamycin), methoxymalonate (e.g. FK-520), 
ethylmalonate (e.g., FK-520), amino acids or amino acid derivatives that are incorporated into 
polyketides by the action of a non-ribosomal peptide synthase (e.g., thiazole in epothilone and 
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pipecolate in rapamycin), or other units incorporated by, for example, an AMP hgase (e.g., the 
dihydroxycylohexyl moiety in rapamycin, FK-506, and FK-520) or a soluble CoA ligase. An 
illustrative set of additional starter and extender units includes: 




=J '' X V' = K' = M' 
R R h 

where R can be anything other than hydrogen or methyl (e.g., allyl, butyl, ethyl, hexyl, hydroxyl, 
isobutyl, and methoxy). 

The set of monomers can also include post-PKS modifications, such as hydroxylation, 
methylation, epoxidation, glycosylation, or addition of intra-macrocyclic fused rings making the 
system polycyclic. Also, a variety of methods are known for the incorporation of unusual starter 
and or extender units in polyketide synthases (see, e.g., PCT Publication Nos. WO 97/02358; 
WO 99/03986; WO 98/01546; and WO 98/01571, each of which is incorporated herein by 
reference, and the monomer set can include such units. 

By viewing polyketides as composed of sets of distinct monomers, one can in 
accordance with the present invention define a polyketide as a string of alpha-numeric symbols 
to facilitate computer analysis. In one method, a modified CHUCKLES methodology for 
representing polyketides is used. The CHUCKLES methodology (see Siani et aL, 
"CHUCKLES: a method for representing and searching peptide and peptoid sequence," J. 
Chenu Inf. Sci. 34: 588-593 (1994) which is incorporated herein by reference) for representing 
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peptides and related oligomers allows monomers to be strung together such that the molecular 
graph for the basic macrocycle can be generated from the string of monomers. 

For example, using the set of monomers comprising A-N described above, the 
erythromycin macrocycle or 6-dEB can be represented as ADGJDD. This string of 
alphanumeric symbols is also referred to as the CHUCKLES string. Figure 3 depicts the 
relationship between the CHUCKLES string, the SMILES string, and the actual molecular 
structure of 6-dEB. The CHUCKLES string for 6-dEB can be annotated to represent the 
structure of erythromycin A: A(l-lactone closure > 2-h>dn)xyl)IX3J(2-hydroxyl)D(l-glycosyl) D(l- 
glycosyl). Thus, ring closure (cyclization) and post-synthetic modifications (glycosylation and 
hydroxylation), and non-standard units where applicable (there are none in 6-dEB and 
erythromycin) are entered between parentheses after each monomer. Another example is an 
annotated CHUCKLES string for epothilone B: ME(l-lactone-closure)M(epoxide)LJDG(2- 
methylation)E. As above, cyclization, post-synthetic modifications (epoxide formation), and 
non-standard units (methyl at C-4) are entered between parentheses after each monomer. 

In another aspect of the present invention, a database of polyketides is provided. In one 
aspect of the present invention, the polyketides are represented by a string of defined monomers. 
In one embodiment, the monomers are selected from a group consisting of: 
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In another embodiment, polyketides are represented by the monomers A-N as well as 
additional monomers selected from the group consisting of 



S H OH OH 

R ft R 



= C; 



OH 



R R 



= J ' ; = K ' > and /*Sf^ = M' 

R ft ft 

where R can be anything other than hydrogen or methyl. 

The string of monomers can be represented as a linearized structure or as a string of 
symbols. For example, the erythromycin can be represented as its aglycone, 6-dEB, as 




or as a string of symbols, ADGJDD. Optionally, the string of symbols can be annotated as "A(l- 
lactone closure^-hydroxyipGJ(2-hydioxyl)D(l-glycosyl) D(l-glycosyl)" to more fully capture 
the erythromycin structure. This set of annotated strings is referred to as a "coded library" or a 
"coded" database of the present invention. 

In an illustrative embodiment, the polyketide database consists of the polyketides 
described in current literature (Journal of Antibiotics (1981-present), Journal of Natural 
Products) and various databases (Chemical Abstracts CAPlus, AntiBase). All unique 
macrocyclic polyketides are converted to the modified CHUCKLES format. Of the -1 000 novel 
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polyketides obtained, only -200 difiFerent strings of monomers and unique macrocycles are 
needed to represent the much larger collection of polyketides in the database, because many of 
tiie differences between the naturally-occurring polyketides are due to different glycosyi (sugar) 
groups attached at different positions on the macrocycle. 

Thus, a macrocyclic polyketide can be converted to a string of 2-carbon monomers by 
mapping the monomers onto the polyketide. This can be performed manually or with computer 
assistance. First, any sugar moieties are conceptually removed by hydrolysis and any lactones 
(bond between the ketone and oxygen) are hydrolyzed thus generating a linearized structure of 
the backbone of the polyketide. Generally, this leaves a carboxy carbon at one end of the linear 
molecule and a hydroxyl at the other. The polyketide is thai "sequenced" manually or in silico 
from the end containing the carboxy carbon, the end corresponding to the last monomer added 
by the PKS before synthesis is complete. This end saves as a convenient handle from which to 
start the mapping process. Although closing of the lactone often occurs between the two ends of 
the polyketide, this is not always the case. However, the last ketone added by the PKS is almost 
always involved in macrolactone formation and so serves as a more convenient handle than the 
hydroxyl for commencing sequencing. 

The manual or in silico sequencing is performed by matching the monomers, one at a 
time, while traversing the macrocyclic backbone. First the carboxy carbon is skipped, and an 
attempt is made to match each of the monomers in the monomer set selected (i.e., monomer set 
A-N in Figure 2) against the next two carbons in the macrocycle. The match takes into account 
carbon, oxygen, and no substitution at each backbone position, chirality at each backbone 
position, and bond order between the two backbone carbons. 

If the sequencing is performed in silico, the method is refeixed to as back-translation and 
involves converting a molecular graph into a string of monomers. First, the monomer library is 
converted to SMARTS format. SMARTS is a superset of the SMILES language that specifies a 
pattern in a molecular graph (Daylight Software Manual: Theory; Daylight Chemical 
Information Systems; Irvine, CA 1993, incorporated herein by reference). SMARTS permits 
one to specify a variable number or a limit on the number of covalent bonds to non-hydrogen 
atoms from a particular atom. In contrast, SMILES assumes that the unspecified valences are 
hydrogens. For example, the SMILES string for monomer A is [C@@H](0)[C@H](C). The 
oxygen may be bonded to any other single atom; if the atom is not specified, it is assumed be a 
hydrogen. In the SMARTS string for monomer A, [C@@H](0;D2])[C@H]([CH3]), one can 
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specify the exact number of hydrogens on some atoms (e.g. f "CH3"). In addition, the "[0;D2]" 
indicates the oxygen is bonded to two (from D2) non-hydrogen atoms, in this case the first 
carbon and some other unspecified atom. This allows matching and distinction of post- 
modification moieties attached to the oxygen as well as additional cyclizations (six member 
rings can occur within the macrocycle; ag., rapamcyin). Thus, the SMARTS notation allows 
pattern matching against the polyketide molecular graph. 

When a match occurs, the atoms that match are tagged as part of a superset and labeled 
with the monomer name. Any atoms that are connected to the monomer that are not part of the 
macrocycle are tagged for identification as special precursor units (e.g., ethylmalonate instead of 
methyl malonate or malonate), or post-synthetic modification moieties (e.g., sugars, CCHO, 
hydroxylation, methylation). If all the atoms and bonds of the monomer cannot be identified, 
the monomer is given a designation to indicate the lack of identification (e.g., Q for question 
mark). These Q monomers can be used to identify monomers that are the site of post-PKS 
modifications that mask the function of the PKS module that generated that portion of the 
polyketide or that are not in the monomer set and so prevent the correlation of a particular 
segment of the backbone with one of the monomers in the monomer set 

After a particular 2 -carbon unit is identified, the next two carbons are processed the same 
way. Ibis is repeated until all the backbone carbons are identified and labeled as monomers. 
When all two-carbon units are identified, one has generated an ordered sequence, or string, of 
monomers, which is a modified CHUCKLES string of the invention. Moieties corresponding to 
post-PKS modifications are appended to the monomer in the string as an annotation in 
parentheses. This method of sequencing maybe extended to include any type of monomer. 
Figure 5 shows a flow chart of this matching method for the generation of the CHUCKLES 
strings used for all polyketides in a library. 

The CHUCKLES string can be in the order corresponding to the direction of 
biosynthesis on the PKS or its reverse. Each CHUCKLES string has a one-to-many relationship 
with the PKS gene in the producing organism. Thus, while many different organisms can 
produce the same polyketide using the same or different PKS genes, each PKS gene generally 
produces only one PKS that produces only one polyketide (some AT domains can bind different 
CoAs, leading to the production of multiple polyketides from a single PKS). This allows one to 
design, from the polyketide structure, a set of PKS genes that would produce that polyketide. 
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Thus, the present invention provides methods and computational analysis tools for 
designing PES genes to produce a desired polyketide. As an illustrative example, the present 
invention provides a computer program termed MORPH (see the Examples below) that can read 
the coded library (see the Examples below). An illustrative coded library consists of -200 
unique polyketide CHUCKLES strings. The user specifies the target polyketide, which is 
converted from molecular structure to a CHUCKLES string. 

The program then performs the following, starting with each library compound or string: 

(1) aligns library compound and target compound, emphasizing alignment of adjacent 
monomers common between the two; 

(2) fills in the gaps using all possible combinations from all library members; 

(3) counts number of non-natural inter-modular boundaries, 

(4) outputs all these alignments. 

The alignments are then sorted based on the number of non-natural inter-modular 
boundaries. 

This illustrative program allows one to design and find PKS genes that encode PKS 
enzymes that are combinations of two or more different PKS enzymes with the fewest inter- 
modular boundaries, and optionally the fewest inter-protein boundaries. Many other alternative 
embodiments are provided by the present invention. 

For example, one can include the naturally occurring PKS that produces the target 
polyketide in the coded library to allow components of that PKS to be incorporated into the 
design of a new PKS. Also, one can include in the coded library non-naturally occurring PKS 
enzymes, such as those produced and published in the scientific and patent literature to make 
novel polyketides, in the coded library. See, e.g., PCT publication Nos. WO 98/493 1 5 and WO 
96/40968, both of which are incorporated herein by reference. 

This CHUCKLES-coded polyketide library can be stored in a computer file as a set of 
records. In one embodiment, each record contains the chemical name of the polyketide, the 
unannotated CHUCKLES (containing basic macrocyclic monomers), the annotated 
CHUCKLES (containing basic macrocyclic monomers with information about post-PKS 
modifications), the producing organism(s), and other information (e.g., linearized representation 
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of the polyketide structure, the accession number of organisms or plasmids that have been 
deposited, gene sequence information, and references). 

The MORPH program can read in the polyketide library entries to an array or list of dat 
structures, where each entry data structure contains all or a selected subset of the fields in each 
library record The MORPH program then reads in the CHUCKLES-coded TARGET 
polyketide from the user. This TARGET may optionally be blocked from thelibrary so that it : 
not used as a STARTER or left in the library, i.e., if it is only distantly related to other known 
polyketides, or some modules could be useful in designing novel PKS genes, or it is desired to 
replace only certain PKS modules. This program could also be used for analoging at a particul 
position via wild-cards defined as part of the TARGET sequence by the user. 

Bach member of the coded PKS library can be selected as a STARTER unit Thus, 
during a run, all library members can be given an equal chance as STARTER units. After a 
STARTER is chosen, the TARGET is aligned with it. See Flowchart Figure 4D. Any method 
of alignment can be used such that the maximal number of adjacent STARTER modules is use< 
in the final alignment After the maximal adjacent modules are used in the ALIGNMENT, 
smaller adjacent sets or individual modules from the STARTER are used to fill in the gaps. 
There may be several alignments that are equally good based on the attempt to optimize the 
number of adjacent modules. For example, if the TARGET contains the "JDG" substring, then 
6-dEB, identified with the AJD2G3 J4D5D6 CHUCKLES string, may align as 



TARGET 


JDG 


6-dEB 


J4D2G3 


TARGET 


JDG 


6-dEB 


J4D5G3. 



Both of these alignments have different maximal adjacent modules, with the 
same length of two (D2G3 in the first and J4D5 in the second). Accordingly, either 
alignment could be used as STARTERs. 

With the optimized alignment from the STARTER, other library entries are 
systematically used to complete the alignment, or fill in the gaps. This part may be performed 
on either the optimized ALIGNMENT described above, or the ALIGNMENT without the sing) 
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modules from the STARTER; the removal of the individual modules opens up more space into 
which larger pieces of the FILLER might be placed. The first library entry is designated as the 
FILLER. If the FILLER is the same as the STARTER, the next library entry is used as the 
FILLER This library entry is flagged as the CXJRREOT_FIIXER_LIBRARY_ENTRy. The 
same method for finding maximally adjacent modules and then smaller sets or single modules is 
used to fill the gaps in ALIGNMENT from the FILLER If not all die gaps are filled in the 
ALIGNMENT, then the next library entry is used as a new source; that is, it is designated as the 
FILLER, and the gaps are filled further. This is repeated until the ALIGNMENT is complete or 
the end of the library is reached. 

Assuming all modules in the TARGBT are represented in the library, the ALIGNMENT • 
is eventually completely filled. The completed alignment is then written to an output file on the 
computer disk. When the ALIGNMENT is complete, or there are no more FILLERS in the 
library, the TARGET and STARTER alignment are re-copied to ALIGNMENT. The 
CURRENT_FILLER_LIBRARY_ENTRY is incremented, and a new attempt to fill in the gaps 
is started. 

When the OTRRENT_FILIJBR_LIBRARY_ENTRY has reached the end of the library, 
the ALIGNMENT is wiped, and a new STARTER is chosen. The above process is then 
repeated for the next STARTER When all library entries have been used as starters, then all 
feasible novel polyketide synthases have been generated and written to the computer file. The 
novel PKSs are then read back into memory and can be further evaluated. An illustrative 
evaluation process involves: 

(1) counting the non-native inter-module interfaces, and 

(2) counting the number of native inter-protein interfaces (for known and 
annotated gene sequences). 

The novel PKSs are then sorted based on these two numbers, giving higher priority to the non- 
native inter-module interfaces. In this mode, the goal is to identify those novel PKSs that contain 
the fewest non-native interfaces. 

By providing methods and means for the computer-assisted analysis of polyketides and 
PKS genes, the present invention greatly facilitates the identification and production of new 
polyketides with useful activities. Those of skill in the art will appreciate that while the 
invention is in part illustrated in the Examples below with respect to the design of new PKS 
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genes for known polyketides, the invention can also be used to design PKS genes for novel 
polyketides. In this embodiment, one simply provides the structure of the novel polyketide to 
the MORPH or other program of the invention to generate the desired PKS genes. 

Moreover, while the invention is exemplified below by designing new PKS genes 
composed of the coding sequences for one or more complete modules of two or more different 
PKS genes, partial modules can also be employed. With the appropriate choice of monomer sets 
and corresponding coding of the library to be searched, one can generate new PKS gene designs 
that take advantage of the potential to fuse one PKS gene coding sequence to another at a site 
corresponding to an intra-modular junction. In another embodiment, one can use "wild-cards" 
in the encoded polyketide or library to take advantage of known or predicted SAR. Thus, if one 
knows that a particular position in a polyketide can be varied (i.e., a hydrogen, methyl, or ethyl 
group at a location determined by an AT domain of a particular module, or a hydroxyl or keto 
group at a location determined by the presence or absence of a KR domain in a particular 
module) then one can use a wild-card monomer designation in the polyketide CHUCKLES 
string to generate PKS genes that produce each of the desired variants. 

The methods of the invention have diverse application in addition to the design of new 
PKS genes. As but one illustrative example, the methods of the invention can be used to design 
methods to produce a desired compound. Organic molecules containing stereochemical centers 
are useful for a number of purposes, including use as synthetic or semi-synthetic intermediates. 
The preparation of such intermediates by organic synthesis can be extremely time consuming 
and expensive. An alternative source of such intermediates is via specific degradation of a 
polyketide, and the present invention provides computer-assisted means for designing such 
production methods. 

Thus, certain functional groups of polyketides are susceptible to bond cleavage by 
specific chemical reactions that do not affect other functional groups. For example, carbon- 
carbon double bonds can be specifically cleaved by permanganate without affecting other 
functional groups normally in polyketides, such as ketones, alcohols, and lactones. Likewise, 
the Baeyer- Villager reaction converts a ketone to an ester Oactone) without affecting other 
groups of the aglycone. In accordance with the methods of the invention, one can assemble a 
library of polyketides in a database that can be addressed with a query describing a particular 
chemical reaction to generate all of the degradation products produced by that reaction upon 
each of the polyketides in the library. The degradation fragments thus generated serve as a 
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library of the invention that can be sorted by properties, such as size, number and type of 
stereochemical centers, functional groups, or other factors, and searched for useful compounds. 
Moreover, the functional groups on the ends of the fragments generated (or at other locations) 
can also be converted to other functional groups by chemical reactions (optionally employing 
protecting groups on other functional groups), and the database of compounds can be expanded 
to include the compounds produced by such reactions. 

From even a modest library of -200 compounds, one can in this maimer generate using 
the methods of the invention, two to three times as many valuable chemical intermediates. Once 
such an intermediate is identified, the organism that produces the polyketide from which the 
fragment is derived is fermented, the polyketide isolated in bulk, the chemical reaction 
performed, and the desired degradation produces) isolated and used. In this manner, the present 
invention makes available a wide variety of useful products otherwise unattainable. 

Thus, the present invention has wide application in the fields of chemistry, particularly 
medicinal chemistry, molecular biology, and medicine. Those of skill in the art will recognize 
these and other benefits and applications provided by the present invention. Thus, the following 
examples are given for the purpose of illustrating the present invention and shall not be 
construed as being a limitation on the scope of the invention or claims. 

EXAMPLE 1 

The MORPH Program 

This example provides the source code for an illustrative MORPH program of the 
invention. The MORPH program is a command line driven program that runs on a UNIX 
system. The program can be run from a shell script, such that the user fills in the entire 
command ahead of time, then post-processes the output file with UNIX utilities including sort, 
egrep, and uniq. 

The command line appears as follows: 

moiph3 -1 libraryfile -n targetname -t targetsequence [-x X-wildcards] [-y Y-wildcards] [-z Z- 
wildcards]. 
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The library file is the name of the text file described below in Example 2. The target 
name is a user-defined identifier to distinguish this target from the library members (e.g., 
epothiloneD). The target sequence is a string of monomers that represent the CHUCKLES- 
encoded target polyketide (e.g., MEMLJDGE). Generally, if the target sequence is in the 
library, it is commented out from the library so that the morph program does not find the target , 
itself. The three different wildcards, X, Y, and Z, are independent sets of monomers that can be 
included in the target sequence. 

The output from the morph program can be redirected to a file. This output file is then 
post-processed by (1) extracting the HIT lines with valid combinations of modules that yield the 
target, (2) sorting the HITS based on alphanumeric content using the UNIX sort command, (3) 
running the UNIX uniq command which removes multiple copies of each HIT, leaving one copy 
of each, (4) sorting based on the number of pieces in the sequence of modules. Generally, the 
fewest number of pieces, which correspond to the fewest number of inter-modular interfaces, are 

desired; these will appear at the top of the output 

» 

Below are some illustrative examples of calls to the MORPH program from a shell script 
using epothilone as a target. The first example generates combinations that yield epothilone D: 

%morph3 -1 PKS.lib -n epoD -t MEMLJDGE > omorph3_epoD 

%egrep HIT omorph3_epoD | sort | uniq | sort +10 -11 > omoiph3_epoD.uniq.sort 

The second example generates combinations that yield a derivative of epothilone D having a 

hydtoxylatC-13: 

%morph3 -1 PKS.lib -n epoD-130H -t MEXLJDGE -x ABCD > oepoD-130H 
%egrep HIT oepoD-130H | sort | uniq | sort +10 -1 1 > oepoD-130H.uniq.sort 

The third example generates an epothilone having wildcards (set 1): 

%morph3 -1 PKS.lib -n epoD-setl -t MEXYZDgE -x ABCD -y LEFIN -z JACGM > 

oepoD-setl 

%grep HIT oepoD-setl | sort | uniq | sort +10 -1 1 > oepoD-setl.uniq.sort 
The fourth example generates an epothilone having another set of wildcards (set 2): 

%morph3 -1 PKS.lib -n epoD-set2 -t MEXYZDgE -x JK -y EF -z JACGM > oepoD-set2 
%grep HIT oepoD-set2 | sort | uniq | sort +10 -1 1 > oepoD-set2.uniq.sort 

MORPH in its current implementation operates at the monomer level and thus does not 
handle intra-modular modifications/splitting. Future implementations could convert the 
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OHUCKLES-encoded strings into the corresponding and equivalent SMILES and then perform 
more complex chemical analysis of the PKS molecular graphs. Currently, inter-modular double 
bonds are present in the library, but are ignored by the program. These bonds can introduced 
post-biosynthetically and the exact source is generally unknown. 

The source code for MORPH is found in Appendix A (version 3.0) and B (version 4.0) 
(deposited in the microfiche appendix). 



EXAMPLE 2 

Illustrative Polvketide library 

This example provides the contents of an illustrative CHUCLKES encoded polyketide 
library. The first column provides the name of the polyketide; the second the CHUCKLES 
string; the third the annotated CHUCKLES string; and the fourth the source organism. Entries 
under annotated CHUCKLES and source organism are not complete for all of the polyketides. 
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POLYKETEDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


3-acctyW- 
butyltylosin 


FMNGODF 






#aculeximycin 


RKNRSMRSRSS 
SSRLSSN 


RR(2-cthyl)NRSMRS(l- 

glycosyl)RSSS(24iydroxyl)SR(l- 

glycosyl)LSSN(2-etbyl) 




albocycline-Ml- 

mgrainycin- 

TA2407- 

cineromycmB- 

U28010-SR2077 


BLME=JN 


BLM(1 ^^xy)E(l-inethoxy)=J(2- 
hydroxyl)N 




aIbocycline-M2 


BLME=JN 


B(2-hydraxyl)LME(l -methoxy)==J(2- 
hydroxyl)N 




aIbocycline-M3 


BSME=JL 


BSME( 1 -methoxy)=J(2-hydroxyl)L 




aIbocycline-M5 


BSMB=JN 


BSME(1 -methoxy>=J(2-hydroxyl)N 




aIbocycline-M6 


BLME=JL 


BL{2-hydroxyl)ME{l -methoxyH(2- 
bydroxyl)L 




aIbocycline-M7 


BLME=JN 


BL(2-hydroxyl)ME(l.methoxy)=J(2. 
hydroxyl)N 




aIbocycline-M8 


BLME=Q 


BLME(l-methoxy)=0 




aldgamycin 


BMLGJDL 


B(l-cyc)MLG(2-hydroxyl)JDL(2- 
cyc) 




amphotericinA 


CDNNLNNNNF 
CEFEALEE 


CDNNLNNNNF(l-glycosyl)C(l-0 

cyc,2-carboxylicacid)EF(l- 

cyc)EALEE 




^amphotericinB 


CDNNNNNNNF 
QQQEELEE 


CDNNNNNNNF(l-glycosyl)C(l-0- 

cyc,2-carboxylicacid)EF(l- 

cyc)EALEE 




angiolain 


NMFJNSJIQLGA 
m 






aplyronineA 


BFJCENFFMEK 
AFNN 


B(1-C(=0)QF(1- 
C(K))C(C)N(C)C)JCENFFME(1- 
methoxy)KAF(l- 
C(=0)C(N(QC)COC)NN 


sea hare Aplysia 
kurodai 


apoptolidin 


EFLMNAMMM 


EF(methoxy-l ,hydroxy-2)LMNAMMM 


aurachinB 


MLMLM 


MLMLM 




aurachinC 


MLMLM 


MLMLM 




A59770 


QQKQQLJFCDN 


QQK(2-ethyl)Q=QU(2- 

hydroxyl)F(2-0-glycosyl)CD(2- 

hydroxyl)N 


Amycolatopsis 
orientalis 


A82548A- 
cytovaricin 


QQQKQNLJEDD 
N 






A83543A 


PLFQQQ 


FLFQQQ 


Saccharopolyspora 
spinosa 


AB023a 


NNNNNRSSSLS 


NNNNNRSSSLSR 




AH-758 


RSNMURMN 


RS(2-methoxy)NMURMN(2- 
methoxy) 




bafilomcinD 


BNHCENMKCM 
N 


BNHCE(1 -macrocyc,2- 
methoxy)NMKCMN(2- 
methoxy)Qflceto-macrocyc) 


S.sp. | 


bafilomycinAl 


BNHCENMKCM 
N 


BNHCE(l~macrocyc,2- 
methoxy)NMKCMN(2- 
methoxy)Q(keto-macrocyc) 
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POLYKETTDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


bonelidin 


SNMRUUUS 


SNM(2-N)RUUUS 




calyculin 


QFBBMmNm 


QF(l«mcthoxy)BBMmNm 


Discodermia calyx 


candicidin- 

candeptin-ascosin- 

levonn-etc 


OTNNWNNNNF 
CEEUEE 


CDNNNNNNNF(l-glycosyl)C(l-0- 
cyc^-carboxylicacid)EE(l-cyc)E(2« 
hydroxyl)LIEE 


S. griseus, S. 
canescus, S. 
levoris, S, 
viridoflavus, Stv. 
grisoviridum 


candidin 


QDNNNNNNNF 
CEFEUEE 


RDNNKNNNNF(l^ycosyl)C(1^0- 
cyc,2-carboxylicacid)EF(l-cyc)E(2- 
hydroxyl)LIEE 




carbomycin 


ENNCODF 


EN(l^-epaxy)NHOF(l-glycosyU- 
mcthoxy)F(l-C(=0)C) 




carbomycinB- 
magnamycinB 


FNNGOFF 


FNNGOF( 1 -glycosyl>metfaoxy)F(l - 
CC=0)Q 




carboraycin-A- 

magnamycin- 

deltamycinA4- 

NSC51001-PS97- 

3628-WC3628 


ENNHOFF 


ENNHO(includcs CCHO)FF 




chalcomycin- 
myconomycin- 
aldgamycinDmiko 
nomycin 


BNNGKDN 


B(2<^ycosyl)N(l,2-epoxy)NG(2- 
hydroxyl)KD(l -glycosyl)N 


S. biltiniensiSy S. 
albogriscolus 


chimeramycinB- 
PTL448 


BMNCODF 


B(0-ethyl)MNC(l-glycosyl)OD(l- 
gfycosyRF 


S. ambofaciens ka- 
448 


chivosazolA 


SRRiiNNSRQNn 
RMnNn 


SRR( 1 -0-macrocyc)nNNS R( 1 - 

methoxy)QNnR(l- 

glycosyl)MnNnQ(keto-inacrocyc) 


S. cellulosum 


cineromycinB 


BLME=JN 


BLM&=J(2-hydroxyl)N 


S. | 
cinereochromogene 
s, S. sp. 


cineromycinB dehy 
ro 


BLMI=JN 


BLMI=J(2-hydroxyl)N 


S. grieoviridis 


cineromycinB2,3di 
hyro 


BLME=JL 


BLMEF=J(2-hydroxyl)L 


S. grieoviridis 


cirramycinB 1 dihy 
droxy-A6888X 


BMNGODF 


BM(l,2^poxy)NGOD(l-glycosyl)F 


S. flocculus 


cirramycinB- 

cirramycinBl- 

Acumycin- 

A688A-B58941- 

A6888A 


AMNGODF 


AM(l^-epoxy)NGOD(l-glycosyl)F 


S. cirratius, S. 
griseoflavus, S. 
fradiae, S. 
flocculus 


cladospolideA 


ELLFN 


ELLF(2-hydroxyl)N 


Cladosporium 
fulvum, C. 
cladosporiodes 


cladospolidcB 


ELLFn 


ELLF(2-hydroxyl)n 


Cladosporium 
fulvum, C 
cladosporiodes 


cladospolideC 


ELLEN 


ELLE(2-hydroxyl)N 


fungus 

Cladosporium 
tenuissimum 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


concanamycinA- 
folimycm-A661-l- 
S45A-TAN1323B- 
X4357B 


CENMKCCMN 


CE(2-inemoxy)NMKC(2- 
e%l)0^(2-methoxy) 


S 

diastatochromogen 
ese, S. sp, S. 
neyagawaensis 


concanamycinB- 
S45B 


CENMKCCMN 


CE(2-mcthoxy)NMKCCMN(2- 
methoxy) 


S 

diastatochromogen 
ese 


ccrncanamycinG- 

anhydroconcanam 

ycinB 


NBNHCENMKC 
CMN 


NBNHCE(2- 

me&oxy)NMKCCMN(2-methoxy) 




congloblatm 


AJM 


A(l-oxazpyl)JM.AJM 




#copiamycin 


RNRSSSSSSRLR 
SRN 


RNRSSSS(l-0-cyc)S(2- 
hydroxyl)S(l^yc)RLRSRN 


S. hygroscopicus 


cytovaricin-H230 


QQQKQNLJFCD 
N 


0QQKQNLJ(2-hydroxyl)F(2-O- 
glycosyl)CD(2-hydroxyl)N 


S. sp., S. collinus 


cytovaricinB 


QQKQMJFDDN 


0QQKQNLJ(2-hydroxyl)F(2-O- 
glycosyl)CD(2-hydroxyl)N 


S. torulosus 


CP64537 


ARGKDD 


A(2-glycosyl)R(l- 

C(=0)C(0)C(QC)G(2- 

hydroxyl)KDD(l-glycosyl) 


Streptomyces 
toyocaensis 
hunricola ATCC 
39491 


damavaricinC 


QDQCCNM 


QDQCCNM 


S. spectabilis 


deltamycinAl 


ENNGOFF 


EN(l,2-epoxy)NGOF(l-glycosyl,2- 
methoxy)F(l-C(=0)C) 


S. deltae, S. 
halstedii-deltae 


deltamycinX- 

desisovalerylcarbo 

mycinA 


ENNGOFF 


EN(l > 2-epoxy)NGOF(l-glycosyU- 
methoxy)F(l-C(=0)C) 


S. deltae, S. 
halstedii-deltae 


cngleromycin 


QNJHN 


QNJH(2-hy droxyl)N( 1 ,2repoxy) 


Engleromyces 
goetzei 


#epothilone 


MEMUHDgE 


MEM( 1 ,2-epoxy)LJDG(2-methyl)E 




erythromycin 


ADGJDD 


A(2-hydroxyl)DGJ(2-hydroxyl)D(l-glycosyl)D(l- 
glycosyl) 


espinomycinA2 


ENNCOFF 


ENNCOF(l-glycosyl,2-methoxy)F(l- 
C(=0)C) 


S. fungicidicus 


filipinHI- 
lagosinl4deoxy 


1ANNNNMEKMF 
FFFL 


E(2-hydroxyl)NNNNMEl , 'FFFFFF 


S. filipinensis, S, 
durhamensis 


filipm-lagosin 


ENNNNMEFFFF 
FFF 




S. filipinensis, S. 
durhamensis 


formamicin 


CBNMOCMN 


CBNMO(includes a long, branched 
alkyl chain)CMN 




foromacidinB- 

spiramycinll- 

spiramycinB 


ENNAOFF 




S. ambofaciens 


FD891 


RRUSNNLNRM 
M 


RRUS(l-0-macrocyc)NNL(2- 
hydroxyl)N(l ^-epoxy)RMMQ(keto- 
macrocyc) 


S. graminofaciens 


FK895 


RNRNMRNRLS 


R(l -methoxy)N(l ^-epoxy)KNMR(l - 
0-macrocyc)NR( 1 -C(=0)C,2- 
hydroxyl)LSQ(keto-macrocyc) 


S. hygroscopicus 


FK-506 


MAEPMJJBKOO 






gedamycin 


JEJBNNNnNNNF 
AEFEEEEIE 


IEJBNNNnNNNF( 1 -gly cosyl) A( 1 -0- 

cyc,2-carboxylicacid)EF(2- 

cyc)EEEEIE 


S. aureofaciens 
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POLYKETIDE 


CHUCKLES 


ann o tat ed-CHU CKLES 


SOURCE 
ORGANISM 


geldanamycin 


QULRMRnM 


QUL(2-medioxy)RMS(l-CONH2^- 
methoxy)nM 


S. hygroscopicus 
var. gelanus 


gephyromcacid 


RRTSRRM 






gloeosporone 


FT.TFTT 


ELLE(l-0<yc)I(2-cyc^-hydroxyl)L 


Colletotrichum 
gloeosporioides f. 
sp.jussiaca 


GERI-155 


BNLGJDN 


B(2-0-glycosyl)N(l^epoxy)LG(2- 
hydroxyl)JD(l-Klycosyl)N 


S. GERI-155 


halonncin 


QNCRCBNM 


QNC(l-methoxy)R(l- 
CfOXXBNM 


Mic. halophytica 


herbimycin 


ALDMFdM 


A(l-methoxy)L(2-niethoxy)D(l- 

metboxy)MF(l-CONH2^- 

methoxy)nM 




hygrolidin 


OSNMJCMM 


CE( 1 -O-macrocyc^- 

mefexy)NMJCMMQ(keto- 

macrocyc) 


S. hygroscopicus 


hygrolidin-oxo 


BNHCENMJCM 
N 


BNHCE(l-0-macrocyc^- 

methoxy)NWCMN(2- 

methoxy)Q(keto-niacrocyc) 


S. griseus,S. 
hygroscopicus 


mimamycin 


URRNSLMQQS 


UR(l-0-macrocyc)RNS(l- 

glycosyl)LMR(l-0-cyc)=QS(2- 

cyc)Q(keto-macrocyc) 




juvenimicinAl- 

T1124A1- 

M4365A1 


BMNGFDF 


BMNGFDF(includes an ethyl in pos 
2) 


Mic. chalcea j 


juvenimicinA2- 

T1124A2- 

M4365A2 


BMNGJDF 




Mic. chalcea 


juvenimicinA4- 

T1124A4- 

M4365A4 


BMNGODF 




Mic. chalcea-Mic. 
capillata 


juvenimicin-T1124 


BNNGJDF 




Mic. chalcea 


kanchanamycin 


NMSSSSSSSRL 
RSRNN 






lankamycin- 
kujimycin- 
landavamycin- 
A20338N2 


ADGJDD | 




S. violaceoniger, S. 
spinichromogenes 


leinamycin 


QNNIMLN 






leucanicidin 


CENMKCMN 


CE(2-methoxy)NMKCMN(2- 
methoxy) 


S. halstedii . 


leucomycinA12- 
kitasomycinA12 


FNNCODF 




S. kitasatoensis 


leucomycinA14- 
kitasomycinA 1 4 


FnNCOFF 




S. kitasatoensis 


leucomycinA3- 
josamycin- 
platenomycinA3- 
turimycinA5 


ENNAODF 




S. kitasatoensis, S. 
iiydroscopicius, S. 
naibonensis, S. 
platensis 


leucomycinAS- 
turimcinH4 


ENNCOFF 




S. kitasatoensis 


lienomycin 


SSSNSSSTSML 
NRRNNNNNL 







WO 01/92991 PCT/US01/17352 

26 



POLYKETTDE 


CHUCKLES 


annoteted-CHUCKLES 


SOURCE 
ORGANISM 


iUwcxjMJuiywiir* 
bU uov\/iiiy win 

lucimycm-FJ1163- 
butylpimaricin 


N 
i^ 




Apt on ^ 1nr»pnci c 
r\.vi« op , O. luwGUMo, 

L>. gulUvUu 


L155175 


RSNMURmM 


RS(2-methoxy)NMURmM 




L681110 


NMURMN 


NMURMN(2-methoxy) 




macrocm- 
lactenocin 


■QTmXTIJnTYI? 

JoJYlNJiULJr 




S. aureus, S. lutea, 
K. pneum, B. 
subtly Shva.. 


macrocin-Y07625 


OMNGODF 




S. fradiae gs 16 


maridomycinL. 


ENNCOFF 




S. paltensis, S. 
rimosius, S. 
capuensis, S. 
racemochromogene 

s 


man domy c in- 
platenomycinC3 - 
turiniycinEP5- 
B5050A-YL704- r 
L-j 


ENNCJDF 




S. hygroscopicus- 
S. platensis- 
malvinus 


mathemycinA 


MRMRNLRL 






midecamycinAI- 
platenomycinB 1 - 

01701*7 

or 53/ 


ENNCOFF 


ENNCOF(2-methoxy)F 


S. mycarofaciens 


midecamycinA2- 
mydecamycinA2- 
SF837A2 


FNNDJEE 


FNNDJE(2-hydroxy)E(l-C(=0)CC) 


S. mycarofaciens 


milbemycin 


QQQMKNQQ 






#monazomyc in 


SMRMRNLRSL 

T 

L» 






mycuiamicin 


UJNINoJJL^JN 


x>( i -eye jiNrsio j JJ1N i-6-cyc j 




mycmamicinVI 


■QXTNin TTWT 

Jt5JNINvjJJL>IS 




Micxomonospora • 
griseorubida sp. 


Illy JcilJiiW LILA. 1 1 


DIN 1>J VJ-LdLxIN 




o. aureus, o. 
pyogenes, 

wUlVUCUaU LCI lUIil 


#mycolactone 


SRMUSMSSL^S 

1YXL 1 IVJlTH N 


SRMUSMSSL.sSMNMMN 




mycolactoneA 


SRMUSMSSL 


SRMUSMSSL 




11 1 j M%3 Low LUl lClJ 


SOItH^iIYUYUN 


oOlViJNIYJIYUN 




myxovirescinAl 


QQEFLNNLLIL 

JVJ 






mvxovirescinA2 


OOEFLNNLULJ 

\J A^a. fcn^l ^1^1 * 1 I'll rtr 

J 






myxovirescinB- 
mcgovalicinB 


QQEFLNNLLIL 
KM- 






myxo virescinC-C 1 


QQEFLNNLLLL 
KJ 






myxovirescinD 


QQEFLNNLLLL 
KM 






rnyxovirescinE 


OQEFLNNLUU 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


myxovirescinFl 


QQEFLNNLLLL 
KJ 






myxovirescinF2 


QQEFLNNLLLL 
JJ 






myxovirescmGl 


QQEFLNNLLLK 
J 






myxovirescinG2 


QQEFLNNLLLJI 






myxovirescinHl 


QQEFLNNLLIL 
KJ 






myxovirescinH2 


QQEFLNNLULJ 
J 






myxovirescinL 


QQEFLNNLLIL 
HJ 






myxovirescinPl 


QQEFLNNLLIL 
LKJ 






myxovirescinP2 


QQEFLNNLLIL 
UJ 






myxovirescinQ 


QQEFLNNLLIL 
KM 






myxovirescinS 


QQEFLNNLLIL 
HJ 






M4365G2 


BMNGODF 




Streptoverticillum 
kitasatoensis, S. 
thermaotolerans 


nancimycin 


QNDACBNM 




S. albovinaceus 


neocopiamycin 


NRSSSSSRLRSR 
N 






niddamycin- 
F3463- 

3desacetylcarbomy 
cmB 


FNNGOFF 


FNNGOF(l-glycosyl,2-methoxy)F 


S. aureus, S. lutea, 
B. subt 


oligomycinA 


QJNNJRGAGAN 




diastatochromogcn 
es, S. chibaensis 


oligomycinB 


QJNNKCHDHD 
N 




S. 

diastatochromogen 
es 


oligomycinB- 
44homo 


QJNNJRGRGAN 




S. bottropensis 


oligomycinD 


QJNNJBGAGAN 




S. arabicus, S. 
parvulus, S. 
rutgersensis, S. 
griseus,S. 
aureoftciens 


ossamycin 


QQQNLLKFDD 
N 






perimycin 


JBNNNnNNNFC 
EFEBEEEB 


JBNNNnNNNFCEFEEEEEE 




pbcnalamid 


nCNNNNM 






phcnaJamideAl- 
fenalanrid-M02-C 


JMCMNnNM 






phcnaJamideA2- 
102-T 


JMCMNNNM 






phenalamideA3 


JMCMNNnM 
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POLYKETTDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


phenalamideB 


JMCMNhNM 






phcnaiamidcC 


JMCMNNNM 






phthoramycin 


OONLUSRKN 






pikromycin 


ANGJDH 






prasinan-L155175 


DNMJCMM 




S. hydroscopicus 
ma 5285; S. 
prasinus 


protostreptovaricin 


QDACDNM 






PD118576A2 


ENMJCMN 




S. sp. wp 3913 


PF1163 


EJLLQ 






#quinolidomicin 


SSLULSSNNNSI 
SNNSSSUSRRQ 
QKRSS 






rap amy c in 


FGMEGJNNME 
EKQQ 




i 


rhizopodin 


RLQSNNSS 






rifkmycm 


OOQNDACBNM 






rimocidin 


BNNNNFCEFEL 
IA 






rosamicin- 
repromicin 


BMNGODF 




* 


rosaramycin 


BMNGODF 




Micromonospora 
rosaria 


rustmicin 


QOQJMOG 


QQQJMO(includes COH)G 




rutamycin 


QQJNNJBGAGA 
N 






scytophycin 


BFCEENQEMN 






scytophycinB-E 


BFCEONQEMN 






shurimycin 


NNRSSSSSSRLR 
SRNN 






sorangicin 


LKMFnENLNCF 
DFNQNNnn 


LKMF(2- 

hydroxyl)nENLNCFDFNONNnn 




sorangicinA 


QNLNF 






sorangicinB 


NLNF 






sorangolideA 


LLLLKBUMEA 
FM 




myxobacterium 

sorangium 

ccllulosum 


soraphen 


EUFJDFA 






spiramycin 


nNCJDF 






staphcoccomycin- 

angolamycm- 

shincomycin 


CMNGODF 






stipiamide 


JMCNNnNM 






tartrolonB2 


nNLFHE 


nNLFHE 


fragment 


tedanolide 


JGEHMDHF 


JGEHMDHF(2-hydroxyl) 




thiazmotrienoinyci 
n 


QLmRSNNNS 






tiacumicin 


SMMRMSNM 






tylosmC-macrocin 


EMNHODF 






tylosin-A 


EMNGODF 






TAN-1323 


NMURRMM 






TMC-34 


NRSSSSSSRLRS 
RN 
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xvL X AJ& 1 xUHt 




a n ti n f a tori -I" 1 HIT PTfT TTQ 


ORGANISM 


vcnturicidinA 


CKNFLMQQQ 






vicenistatin 


LNMULRNN 






virustomycin- 
TAN1323C 


QNMKCCMN 




S. sp. ch41 


zmcophorin- 
griseochelin 


KMCNLACC 




S. griseus 



la another embodiment, the polyketide library includes the name of the polyketide, the 
CHUCKLBS string and a linearized representation of the structure. The linearized 
representations of the CHUCKLES structures for erythromycin and epothilone are as follows: 




An illustrative example of a polyketide library containing linearized representations of their 
structures is found in Appendix C (deposited in the microfiche appendix). 

EXAMPLE 3 

Alternative PKS Genes for Epothilone 

This example illustrates the alignment and design of novel PKS genes for the target 
epothilone. Epothilone is first converted into CHUCKLES string format and then read into the 
MORPH program as a TARGET. The program then generates all possible alignments of library 
modules and sorts the alignments to determine preferred combinations of modules for gene 
construction and production of epothilone via a novel polyketide synthase gene. 
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L I 


M 






13 


\^ 




G+ 


i E 




O 


OH 


0 



The epothilone D structure above was first opened at the macrolactone ring closure 
between the C-l-ketone and the C-15-oxygen. The monomer set shown in Figure 2 was then 
matched against each of the successive pairs of macrocyclic backbone carbon atoms, starting 
with C-2 and C-3, which match monomer E. The next two carbon atoms C-4 and C-S match • 
monomer G with an additional post-synthetic methylation on C-4. C-6 and C-7 match monomer 
D. C-8 and C-9 match monomer J. C-10 and C-l 1 match monomer L. C-12 and C-13 match . 
monomer M. C-14 and C-15, where C-15 has a hydroxyl substitution (modified by thioesterase 
to close the macrocycle), match monomer E. C-16 and C-17 match monomer M. 

The rest of the molecule, a methyl-substituted thiazole moiety, does not match any of the 
monomers in the monomer set. This moiety corresponds to a maionyl CoA loading module and 
an NRPS module that together generate the methyl-substituted thiazole moiety. This moiety is 
thus omitted from the CHUCKLES string generated from this illustrative monomer set but can 
be added simply by adding a monomer to the set. The CHUCKLES string generated is 
EGDJLMEM, which is in the reverse order of biosynthesis. This sequence is then reversed to 
MEMUDGE to yield a monomer sequence that matches the order of biosynthesis. The 
sequence is then annotated to account for the post-synthetic modifications as follows 
MBMLJDG (2-methyl)E. 

This target sequence is provided to the MORPH program to generate all possible 
combinations of modules in the CHUCKLES-encoded library that will yield the target 
CHUCKLES. The valid combinations are then sorted in increasing order of non-native inter- 
module interfaces. In one implementation, a MORPH run generated 3,452 valid sequences of 
five inter-module interfaces. Of these, none contain fewer than five inter-module interfaces. 
Some illustrative sample module combinations appear below. The combinations are shown 
listing each monomer followed by a colon and the name of the polyketide(s) from which it is 
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derived, followed by a parenthetical showing the associated monomers in that polyketide. 
Vertical lines represent modular junctions between two different polyketides. 

Illustrative PKS Gene 1: 

MrSacetyl^utylryltylosinCFMN) | E:tedanolide(GEH) | M:aldgamycin(BML) 
L:ridgamycin(MLG) | J:aldgamycin(GJD) D:aldgamycin(JDL) | G:tedanolide(JGE) 
B:tedanolide(GEH) 

Illustrative PKS Gene 1 thus comprises one or more open reading frames that encode, in 
the order listed, the module from the acetyl-4"-butyryltylosin PKS that corresponds to monomer 
M, the module from the tedeanolide PKS corresponding to monomer E, the modules from the 
aldgamycin PKS corresponding to monomers M, L, J, and D, and the modules from the 
tedanolide PKS corresponding to monomers G and E. 

Illustrative PKS Gene 2: 

M:albocycline-Ml-ingr^ 

Ml- ingramycin-TA2407K;ineromycinB-U28010^R2077(MEJ) | M:albocycline-Ml- . 
ingramycin-TA2407^in^mycinB-U28010-SR2077(LME) | L:aIbocycline-Ml- ingramycin- 
TA2407-cineromycinB-U28010-SR2077 (BLM) | J:erythromycin(GJD) D:erythrotnycin(JDD) | 
G:tedanolide(JGE) E:tedanolide(GEH) . 1 

EXAMPLE 4 

Alternative PKS Genes for 6-Deoxvervthronolide B 

This example illustrates the alignment and design of novel PKS genes for the 
erythromycin basic polyketide structure (6-dEB) using the MORPH program. 
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For the 6-dEB structure above, the CHUCKLES string is generated by first opening the 
macrolactone ring closure between the C-l-ketone and the C-13- oxygen. Using the monomer 
set and matching protocol described in Example 3, one generates the CHUCKLES string 
DDJGDA, in the reverse order of biosynthesis. This sequence is then reversed to ADGJDD to 
yield the monomer sequence that matches the order of biosynthesis. The sequence is then 
annotated to account for the post-synthetic modifications (erythromycin A) as follows A(Z- 
hydroxyl)DGJ(2-hydroxyl)D(l-glycosyip(l-glycosyl). 

This target sequence is supplied to the MORPH program to generate all possible 
combinations of modules in the CHUCKLES-encoded library. The valid combinations are then 
sorted in increasing order of non-native inter-module interfaces. In one implementation, a 
MORPH run generated 1 9,63 1 valid sequences of less than or equal to five inter-module . 
interfaces. Of these, 13,306 contain 4 inter-module interfaces, and 256 contain only 3 inter- 
module interfaces. Some of these contain only two inter-module faces, and one only contains 
one. Some illustrative sample module combinations follow. 

Illustrative PKS Gene 1 : 

A:amphotericinA(EAL) | D:aldgamycin(JDL) | G:mycinamicin(NGJ) Jimyckamicin(GJD) 
D:mycinamicin(JDN) | D:amphotericinA(CDN) 

Illustrative PKS gene 1 thus comprises one or more open reading frames that encode, in 
the order listed, the amphotericin PKS module corresponding to monomer A, the aldgamycin 
PKS monomer corresponding to monomer D, the mycinamicin PKS modules corresponding to 
monomers G, J, and D, and the amphotericin PKS module corresponding to monomer JJ. 
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A:amphoteridnA(EAL) | D:aldgamycin(JDL) | &pLkromycin(NGJ) Jrpikromycin(GJD) 
D:pikromycin(JDH) | D:aldgamycih(JDL) 

Illustrative PKS Gene3: 

A:lankamy(m-kujimycin-lan^ D:lankamycin4mjimycin- 
kndavamycin-A20338N2(ADG) G:lankamycin-kujm 
J:lankamycin-kujimycin-landavamyciii-A2033 8N2(GJD) | D:ossamycin(FDD) 
D:ossamycin(DDN) 

Illustrative PKS Gene 4: 

A:amphotericinA(BAL) | D:Iankamycin-kujimycin-landavamycin-A20338N2(ADG) 
G:laiikamycin-laijimycin-landavamycin-A20338N2(DGJ) Jrlankamycin-kujimycin- 
landavamycin-A20338N2(GJD) | D:A82548A-cytovaririto(EDD) D:A82548A- 
cytovaricin(DDN) 

Illustrative PKS Gene 5: 

A:lankamycin-kujimycin-landavamycin-A20338N2(-AD) Drlankamycin-kujimycin- 
landavamycin-A20338N2(ADG) G:lankamycin-kujimycin 
J:lankamycin-kujimycin-landavamycin-A2033 8N2(GJD) D : lankamycin-kuj imycin- 
landavamycin-A20338N2(JDD) D:lanka^ 

EXAMPLES 

Source Code: 
^include <stdio Jh> 

/* ^ani/programs/morph/moiph3.c 

PURPOSE: To traverse recursively all the entries in PKSJib, 
generating all feasible combinations of PKS modules to make the TARGET (e.g., epothilone). 
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INPUT: libraryfile: tab-delimited CHUCKLES-coded polyketides file with the 
following columns: 

1. polyketide name 

2. plain CHUCKLES 

3. annotated CHUCKLES (contains information about post- 
synthetic modifications) 

4. source organism; 
targetname: user-defined name (e.g., epoD); 

targetsequence: CHUCKLES-coded polyketide of desired TARGET (e.g., 

MEMLJDGE); 

X, Y, Z sets of wildcards: sets of monomers for particular positions 
appearing in target sequence (the wildcards can be used for analoging the TARGET polyketide); 
hard-coded parameters which may be reset (requires recompiling): 

NBOUNDARY_CUTOFF determines the maximum number of 
non-native inter-modular interfaces which are contained in the output (set to 5, but may be 
increased or decreased); and 

RECURSION_COUNim_CUTOFF specifies the number of 
levels of recursion (defaults to 0, 1, 2) acceptable for the run — a large PKS library can cause 
recursion that will greatly increase run time; because of the multi-directionality of the 
alignments (using every library entry as a STARTER), there is typically no need to go beyond 2 
levels of recursion. 
OUTPUT: 

All combinations of modules that meet parameters set by user. Example 
output from MEMLJDGE (epothilone D) using a subset of a PKS library is provided below. 
Vertical bars indicate non-native inter-modular interfaces. Last column contains the number of 
"pieces" that are needed to put together the PKS. 

Names of PKSs have been abbreviated to fit them in these 

comments. 

HIT M:3atyl(FMN)| E:tedan(GEH)| M:aIdga(BML) L:aldga(MLG)| 

J:aldga(GJD) D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HIT M:albMl(LME) E:aIbMl(MEJ)| M:albMl(LME)| L:aldga(MLG)| 

J:aldga(GJD) D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HIT M:aIbMl(LME) E:albMl(MEJ)| M:aldga(BML) L:aldga(MLG)l 

J:aldga(GJD) D:aldga(JDL)| G:3atyl(NGO)| E:albMl(MEJ)| 5 

HIT M:albMl(LME) E:albMl(MEJ)| M:aldga(BML) L:aldga(MLG)| 
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J:aldga(GJD) D:aldga(JDL)| G:aldga(LGJ)| E:aIbMl(MEJ)| 5 
HIT M:albMl(LME) E:aIbMl(MEJ)| M:aldga(BML) L:aldga(MLG)| 
J:aldga(GJD) D:aldga(JDL)| G:aldga(LGJ)| E:aIbMl(MEJ)| 5 
USAGE: 

moiph3 -1 libraryfile -n targetname -t targetsequence [-x X-wildcards] [-y Y- 
wildcards] [-z Z-wildcards] 
examples: 

# generate combinations that yield epotbilone D 

%moiph3 -1 PKS.lib -n epoD -t MEMUDGE > omorph3_epoD 

%egrep IOTomorph3_epoD | sort | uniq | sort +10-11 > 
omorph3_epoD.umq.sort 

%egrep ALIGNJTARGET omoiph3_epoD > 
omorph3_epoD_STARTER_ALIGN 

. # generate combinations that yield epothilone D with a 
C13-hydroxyl 

%morph3 -1 PKSJib -n epoD-130H -t MEXUDGE -x ABCD > 

oepoD-13QH 

%egrep HIT oepoD-130H | sort | uniq | sort +10 -1 1 > 
oepoD-130H.uniq.sort 

%egrep ALIGNJTARGET oepoD-130H > oepoD-1 30H_STARTER_ALIGN 

# generate combination that yield epothilone with the 
following wildcards (set 1) 

%morph3 -1 PKS.lib -n epoD-setl -t MEXYZDgE -x ABCD -y LEFIN -z 
JACGM>oepoD-setl 

%grep HIT oepoD-setl | sort | uniq | sort +10 -1 1 > oepoD-setl.uniq.sort 

# generate combination that yield epothilone with the following wildcards (set 2) 
%moiph3 -1 PKS.lib -n epoD-set2 -t MEXYZDgE -x JK -y EF -z JACGM > 

oepoD-set2 

%grep HIT oepoD-set2 | sort | uniq | sort +10 -1 1 > oepoD-set2.uniq.sort 
LIMITATIONS: 

This version does not handle intra-modular modifications/splitting because 
morph is operating at the monomer level. Modifications could convert the CHUCKLES-encoded 
strings into the corresponding and equivalent SMILES and then perform more complex 
chemical analysis of the PKS molecular graphs. 
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Currently, inter-modular double bonds are present in the library, but are ingnored by the 
morph program. 
*/ 

#inchide <stdioJ£> 

/* ^ani/piograms/moiph/morph3.c 

*/ 

#defineTRUE 1 
#defineFALSE 0 
#defineDEBUG_MATCH FALSE 
#defineDEBUG_STARTER FALSE 
#defineDEBUG_ALIGN FALSE 
#defineDEBUG_RECURSE FALSE 
#defineDEBUG_WILDCARD FALSE 
#defineMAXLEN 80 
#defineMAX_TYL_LEN 6 
#defineMAX_EPO_LEN 6 
#defineMAXNAMELEN 160 
#defineMAX_LIB_ENTRIES 500 
#defineMAXWILD 3 
SdefineMAXBUF 200 
#defineNBOUNDARY_CUTOFF 5 
MefineREOJRSION_COUNTER_CUTOFF 2 
#defineSTARTER_MDvfIMUM_ADJACENT_ALIGN 2 
#defineMINIMUM_ADJACENT_ALIGN 2 



typedef struct _lib { 


char 


name[MAXNAMELEN]; 


char 


monomersequence[MAXNAMELEN]; 


char 


aimotatedsequencefMAXNAMELEN]; 


char 


alignedsequence[MAXNAMELEN]; 


char 


ahgnedPKSname[>lAXLEN](M^^ ,RN]; 


int 


boundarytorig^ttMAXNAMELEN]; 


int 


marked[MAXLEN] ; 


char 


context[MAXLEN][4]; 


int 


recursionjagged; 


int 


nboundary; 
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}LIB; 

main(int argc, char **argv) 

{ 

int ii=0,jj=0,kb=0,U=0; 

int nlib=0; 

int ecount=0; 

int nfilled=0^ifilledmax=O; 

int epothilonelen=0; 

int nlargestpiece=0,tn]argestpiece=0; 

int mmpass=0; 

int lcount=0; 

int newjinmarked_eotries_filled==0; 
int recursion_counter = 0; 
int nwildcard=0; 

int best_new_unmarked_entries_filled = 0; 

int smallest_acceptable_j>iece = 0; 

int current_nmarked===0,previous_nrnarked==0; 

char *sptr, *eptr, *lptr,*bufptr; 

char *clibptr; 

char *libraryfile; 

char *targetsequence,*targetname; 

char buflMAXBUF]; 

char v^dcardsj>lAXWIIJD][MAXLEN]; 

FILE *libp; 

LIB epotemp; 

LIB hl>rary(MAX_IJB_ENTRIES]; 

LIB epothilone; 

char *progname; 

char **filelist, **fileptr; 

Hbiaryfile^"; 

targetsequence = ""; 

targetname » ""; 

for(ii=0; ii<MAXWILD; ii-H-) { 

for(ijM);ij<MAXLEN;i)++) { 
wildcards[ii]Qj] = \<V; 
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} 

} 

t* process arguments */ 

filelist = fileptr = (char **)(malloc(argc * sizeof(*argv))); 
progname = *argv++; 
if(argc<2) { 

§)rintfl[stdeir, n usage:%s -1 libraryfile -n targetname -t targetsequence [-x X- 
wildcards] [-y Y-wildcards] [-z Z-wildcards]\n M ,piogname); 
exitO; 

} 

while(argc- > 1) { 

if(argv[0][0] = && argv[0][l] != W) { 
/* handle option */ 

*-H-(*argv); /* advance past the minus */ 
switch(**argv) { 

case T: /* get library input filename (PKSJib) */ 

argv++; argc~; 

libraryfile = argv[0]; 

fyrintf(stderr, n -l: libraryfile=%s\n",libraryfile); 
break; 

case V: /* get target name string */ 
argv++; argc-; 
targetname = argv[0]; 

fjnint^stderr^-t: targetname=%s\n",targetname); 
break; 

case V: I* get target sequence string */ 
argv++; argc-; 
targetsequence = argv[0]; 

fprintf(stderr,"-t: targetsequence=%s\n",targetsequence); 
break; 

case V: /* get a wildcard string */ 
argv-H-; argc—; 
strcpy(wildcards[0] 5 argv[0]); 
fprintf(stderr, n -x: 
wUdcards[%d]==%s\n w ,0,wildcards[0]); 
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nwildcard-H-; 
break; 

case y : /* get a wildcard string */ 
argv++; argc-; 
strcpy{wildcaids[l],argv[0]); 
fprintf(stderr,"-x: 
wildc^[%d]= 0 /os\n n > l,wildcards[l]); 

nwildcard-H-; 
break; 

case 'z 1 : /* get a wildcard string */ 
argW-f; argc-; 
strcpy(wildcardst2],argv[0]); 
fprintf(stderr, w -x: 
wildcards[%d]=%s\n B ^,wildcardst2]); 

nwildcard-H-; 
break; 

case 'JC: /* get a wildcard string */ 
argv++; argc--; 
strcpy(wildcards[0],argv[0]); 
iprintf(stderr,"-x: 
wildcai^[%d]^/os\n ,, ,0,wildcards[0]); 

nwildcard-H-; 
break; 

case V: /* get a wildcard string */ 
argv++; argc-; 
strq>y(wildcards[l],argv[0]); 
iprintftetderr/'-x: 
wndcarfs[%d]==%s\n n ,l,wildcards[l]); 

nwildcard++; 
break; 

case *Z: f* get a wildcard string */ 
argvH-; argc-; 
strcpy(wildcards[2],argv[0]); 
fprintf(stdeiT,"-x: 
wildcaidsfyodJ^sNn^^wildcardsPJ); 
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nwildcanB-f; 
break; 
default: 

Q>riiitf(stdeiT,"%s unknown option; 

ignored\n",*argv); 

}/*switch*/ 
} else { /* a regular filename */ 
*fileptH-f - *argv; 
*fileptr = NULL; 

} 

argvH-; 

}/*while*/ 
iffowiidcaid >0) { 

for(ii=0; ii< nwildcard; ii++) { 
$rmtf(stderr/wttdc^ 

^rintf(stdout,"wildcaids[%d]^s\n"4i,wildcards[ii]); 
} 

} 

epothilone jiboundary = 0; 
for(ii=0; ii<MAXNAMELEN; ii++) { 
epothilone.name[ii] = *\D'; 
epothilone.monomersequence[ii] = 'NO 1 ; 
epothilone.alignedsequence[ii] = "NO*; 
epothilone.boundarytoright[ii] = TRUE; 
for(uH);ij<MAXLEN;jj-l^) { 

epothilone.alignedPKSname[ij][ii] = W; 

epothilone.markedQj] = FALSE; 

epothilone.context[jj][0] = "VO"; 

epothilone.context[ij][l] = W; 

epothilone.context[jj][2] = \0 9 ; 

> 

> 

sticpy(epothilonejiame,targetname); 
strq)y(epothilonejiionomersequence,targetsequ0Qce); 
iprintf(stdout, "TARGET: %s\n", epothilone.monomersequence); 
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ecount = 0; 

eptr = epothilonejnonomersequence; 

whilefeptrN'VO'H 

if(ecount = 0) { 

epothilone.context[ecoimt][0] = - f ; 
epothilone.context[ecount] [ 1 ] = *eptr, 
epothilone.context[econnt][2] = *(eptr + 1); 

} else { 

Recount = 
(strlen(epothilone.monomersequence) - 1)) { 

epotfailone.context[ecount] [0] = 

*(eptr-l); 

epothilone.context[ecount] [ 1 ] = *eptr, 
epothilone.contextfecount] [2] = 

} else { 

epotbilone.cx)ntext[ecount][0] = 

*(eptr-l); 

epothilone.context[ecount][l] = *eptr > 
epothilone.context[ecount][2] - 

*(eptr+l); 

} 

} 

epothilone.context[ecount][3] = *\0'; 

eptr++; 

ecountH-; 

} 

for(ii=0; ii<fecount; ii++) { 

^rintfl[stdout, ,, (%s)\n",epothilonexontext[ii]); 

} 

/* library */ 

nlib = get^libraryOibraryfileJibrary); 

^rint^stdout/nlib^yodW'^ilib); 

kk=0;wlule(kk<iilib){ 

I* zero out the epothilone entry with respect to a new alignment */ 

for(ii=0; ii<MAXNAMELEN; ii-H-) { 
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epothilone.aIigaedsequence[ii] = \0'; 
epothilone.boundarytoright[ii] = TRUE; 
for(iH;i<MAXI^;ij++) { 

epothaone.alignedPKSBameQj][ii] = 'NO 1 ; 

epothilonejnarkedQj] = FALSE; 

} 

} 

/* - reset the context back to that in epothilone- */ 
ecount = 0; 

eptr = epothttone jnonomersequence; 
while(*eptr 

if(ecount = 0) { 

epothilone.context[ecount][0] = 

epothilone.context[ecount][l] = *eptr; 

epothUonexontext[ecount][2] = *(eptr+ 1); 

} else { 

if(ecount = 
(strlen(epothilone jnonomersequence) - 1)) { 

epothilone.context[ecount][0] = 

*(eptr-l); 

epothilone.context[ecount][l] = *eptr; 
epothilonexontext[ecount][2] « 

} else { 

epothilone.context[ecount][0] = 

epothilone.context[ecount][l] = *eptr, 
epothUone.cototext[ecount][2] = 

} 

} 

epothilone.context[ecount][3] « \0'; 
eptrH-; 
ecounl++; 

— align STARTER (current library entry) and 



*(eptr-l); 



*(eptr+l); 
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epothilone */ 

sptr = library[kk] jnonomersequence; 
lcount=0; 

while(*sptrN^ f ){ 

iprintfCstdouV'library^ 

kk,lcomt,hT>rary[]dc].monomerseqiiMice[lcount]); 

sptt++; 
lcount++; 

} 

/* Call maximal_adjacent_alignment until it no longer 
returns more than two adjacent modules. There is no reason to ' • 

try to extract individual modules, because this is done as part of the recursive filling of spaces ■ 
from the library. 

*/ 

smaUest_acceptablejpiece = 2; 
eptr = epothilone.monomersequence; 
§)rintf(stdout, "ALIGN^TARGET: "); 
while(*eptr!=^ , ){ 

Q>rintf(stdout, H %c ,, s *eptr); 

eptH-f; 

} 

j^rintf(stdout, ,t \n n ); 

^)rintf(stdeiT, "aligning %d %s\n",kk, library[kk].name); 
best__new_unmarked_entries_filled = 0; 
while((new_unmarked_entries - filled = 
maximd_adjacent_alignment_^ 

mallest_acceptable_j>iece)) >= STARTER_MIMMUM^ { 

if(bestjiew_unmarked_entries_filled < new_unmarked_entries_filled){ 
best_new_unmarked_entries_filled = 

new_unmarked_entries_filled; 

} 

if(DEBUG_STARTER) fprintf(stdout, "STARTER ALIGN: 
newjmmariced_entrie^^^ 

epothilonelen = strlen(epothilone jnonomersequence); 
for(ii=0; ii< epothilonelen; ii++){ 
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ifCDEBUG^STARTER) fprintfl[stdout, "STARTER ALIGN :found 

a best alignment 

between epo jnonomer[%d]=%c in library[%d] .name^=%s\n", 
u,epothiIonejnonomersequence[ii]Jkk,q>othU 

. } 
} 

libraryfkk] jrecursionjagged = TRUE; 
j^>rintf(stdout,"ALIGN_TARGET: Va"); 
dimip_STARTER_align(epo^one^wUdcard,wildcanis); 
^)rintf(stdout, n ALIGN.TARGET: \n"); 
if(best_new_unmaii:ed_entries_filled <= 1) { 

§)rintf(stdout, M ALIGN_TARGET: PROBLEM 
best_new_unmarked__entries_filled = %d\n n ,best_new_unmarked_entriesjfilled); 

§)rintf(stdout, ,, ALIGN_TARGET: PROBLEM skipping this STARTER 
entry for Ubrary[%d]jiame=%s\n", 

kk,library[kk] Jiame); 
library [kk].recursion_tagged = FALSE; 
kk++; 
continue; 

} 

/* - fill in the gaps from the library - */ 

/* generate a fresh copy of epothilone in epotemp */ 
epothilonelen = strlen(epothilone.monomersequence); 
nfilledmax = strlen(epothilone.monomersequence); 
$rintf(stdout, tt nfilledmax=%d\n n , nfilledmax); 
reset_epotemp(&epotemp,epothilone); 
nfilled = 0; 

for(ii=0; ii< epothilonelen; ii++) { if(epotemp.maiked[ii] — TRUE) nfilled++; } 
if(DEBUGJSTARTER) §)rintf(stdout, ,, nfilled from STARTER=%d\n rt ^ifilled); 
for(mmpass = 0; mmpass < nlib; mmpass++) { 

if^mmpass =* kk) { continue; } 

reset_epotemp(&epotemp,epothilone); 

nfilled = 0; 

for(ii=0; ii< epothilonelen; ii++) { 
if(epotemp.marked[ii] = TRUE) nfilled++; } 
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iffafilled >= nfiUedmax) { 

output_fresh_alignment(&q)Otenq)); 

} else { 

current_nmaiked = nfilled; 

previousjimarked = nfilled; 

smaUest_acceptable_piece = 1 ; 

while((newjmmarked_entiies_filled = 
maximal_adjacent_alignment(&epotemp^wUdcard, 
acceptable_piece)) >= MINIMUM_ADJACBNT_ALIGN) { 

current_nmarked += 

new_unmarked_entries_filled; 

if(DEBUG_MATCH) fprmtf(stdout, 
"main: recursion_level=%d, mmpass=%d, previous_ninarked=%d, 
cuirent_nmarked==%d\n", 

recmsion_counter^iimpass,previous_nmaike^ current_nmarked); 

} 

nfilled = 0; 

for(ii=0; ii< epothilonelen; ii++) { 
if(q)otemp jmarked[ii] = TRUE) nfilled++; } 

if(nfilled >= nfiUedmax) { 

output_fresh_alignment(&epotemp); 
continue; /* no need to recurse */ 

} 

if(DEBUG_MATCH) $rintf(stdout, "main: 
about to RECURSE: nrnpass=%d\n n ,mmpass); 

library[mmpass].recursion_tagged = TRUE; 

recursion_counterH-; 
recuxse_through_the_Ubrary(nfilledmax,epo^ 
nlibJibrary,&recursion_counter); 

library(nunpass].recursion_tagged = FALSE; 

recursion_counter~; 

} 

} 

Kbrarylldc].recursion_tagged = FALSE; 
kk++; 
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}/*nlib*/ 
}/*main*/ 

int recurse_through_the_library( 
int nfilledmax, 
LIB epotemp, 
LIB *epothilone, 
int nwildcard, 

char wildcanis[MAXWILD][MAXI^N], 

int nlib, 

LIB *library, 

int *recursion_counter) 

{ 

int ii=0; 

int ecount=0,elen=0; 
int mmpass=0; 
int nfilled=0; 

int IcountH); r 
int previous_nmariced==0, current__nmarked=0; 
int smallest_acceptable_piece = 0; 
char *eptr, 
char *clibptr, 

char boundary|>lAXNAMELEN]; 
int new_unmarked_entriesjfilled=K); 
LIB epotemp_temp; 

if(DEBUG_MATCH) fprintf(stdout,TOCmSE: recursion_coimter=%d, 
nhV=%d\n n ,*recursion_counter > nIib); 

elen = strlen(epotemp jnonomersequence); 

nfiUed = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp.marked[ii] — TRUE) nfiUed-H-; } 

previous_nmarked = nfilled; 

current jamaiked = nfilled; 

smallest_acceptable_piece = 1 ; 

if(nfilled nfilledmax) { return 1; } 

for (tnmpass = 0; mmpass < nlib; mmpass++) { 

if(*recursion_counter >= RE(XfRSION_COUNTER_OJTOFF) { 
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return I; 

} 

if(DEBUG_MATCH) ^rintf(stdoiit, ,r RECURSE: 
recursion_counter==%d, mmpass==%d\n^*recursion_coimter^iimpass); 
if(hl3rary[mmpass].recum = TRUE) { 

if(DEBUG_MATCH) ^rintf(stdoTit,TtECURSE: 
H>rary[%d].recursion_tagged=TRUE; skipping\nVnmpass); 
continue; 

} 

reset_epotemp(&epotemp_temp,epotemp); 
elen = strlen(epotemp_temp .monomersequence); 
nfilled = 0; 

forCii^; ii< elen; ii++) { if(epotemp_temp.marked[ii] = 
TRUE)nfilled-H-;} 

previous _nmarked = nfilled; 
cuirent_nmarked = nfilled; 
while((new_immarked_entries_filled = 
maximd__adjacent_alignment(&epoternp_temp, nwiMcard,wildcards,library, 
mmpass,smaUest_acceptable_piece)) >= 1) { 

cuirent_nmarked += new_unmarked_entries_filled; 
ifpEBUG_MATCH) fprintf(stdout, "RECURSE: recursion_level=%d, 
mmpass=%d, previous_nmarked=%d, current_nmarked=%d\n" 

*i^ursion_counter,minpass,previous_iimaiked, current_nmaiked); 

> 

elen - strlen(epotemp_Jemp.monomersequence); 
nfilled = 0; 

for(ii=0; ii< elen; ii++) { if{epoten^>Jemp jnarkedfii] = 
TRUE) nfilled-H-; } 

if[nfilled nfilledmax) { 

output Jresh_aUgmnent(&epotempJemp)^ 

continue; 

} 

libraryfmmpass] .recursionjagged = TRUE; 
(*recursion_counter)++; 
recurse_tbrough_the_Hbrary(^ 
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rds^lib,Ubrary,recursion_couiiter); 

libraryfmmpass] jecursiontagged = FALSE; 

(*recursion_coiinter>s 
' }/* mmpass */ 
} /*recuise Jhiough Jhe Jib^^ 
f* 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the size of the largest maximal adjacent set of 
monomers inserted. 

PROCEDURE: 

*/. 

int maximal_adjacent_alignment( 
LIB *epothilone, 
int nwildcard, 

char wildcards[MAXWIIJD][MAXLEN], 
LIB ^library, 
int ilib, 

int smallest__acceptable_piece) 
{ 

int ii=0,jj=0,kk=0; 

int ecount=0,lcount=0; 

int epothilonelen=0; 

int nlargestpiece=0,talargestpiece=0; 

int hold_thisjcount=0, hold_this_ecount=0; 

int wildcardmatchNFALSE; 

char *wptr, 

char *largestpiece_sptr,*largestpiece_eptr; 
char *hold_thisjplace_eptr, *holdJhisjplace_sptr, 
int largestpiece_lcount=O,largestpiece_ecount=0; 
char *sptr, *eptr, *lptr,*bufptr, 
if(DEBUGJWILDCARD) { 
if(nwildcard > 0) { 

fprintffctdouV'maximal^: 
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wildcards[0]=%s\n n ,wadcards[0]); 
} 

} 

if(DEBUG_ALIGN) ^rintf[stdout,"maximal_adjacent_alignment 
smaHest_acceptablejiece^^\a^s^ 

sptr = library[ilib] jnonomersequence; 

eptr = epothilone->monomersequence; 

ecount=0; 

lcount=0; 

nlargestpiece=0; 

tnlargestpiece=0; 

hold _this_place_eptr = eptr, 

hold_this_ecount = ecount; 

while (*eptr!= WO { 

sptr = library[ilib].monomersequence; 
lcount = 0; 

hold_this_place_sptr = sptr, 
hold_this_lcount = lcount; 
wildcardmatch = FALSE; 
while(*sptrn i \0 , ){ 

wildcardmatch = FALSE; 
if(epothilone->marked[ecount] == FALSE) { 

/* code for wildcards added MAS 05-16-00 */ 
wptr=" M ; 

iflpeptr == •X 1 ) { wptr = wildcards[0]; } 
else if(*eptr = f Y 1 ) { wptr = wildcards[l]; } 
else if(*eptr = 'Z 1 ) { wptr = wildcards[2]; } 
wWle(*wptr!= , \0 , ){ 

if(*wptr = *sptr){ 

wildcardmatch = TRUE; 

break; 

} 

wptrt-+; 

} 

if((wildcardmatch = TRUE) || (*eptr — 
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*sptr)) { 

tnlargestpiece++; 

if(DEBUG_ALIGN) fprintf(stdout, 
"FOUND a match: len=%d, epo(%d, %c), Ub[%d] jiame=%s (%d, %c)\n", 

tnlargestpiece, 

ecoimt, *eptr, ilib4ibiary[iHb] Jiamejcount, *sptr); 

if(tnlargestpiece > nlargestpiece) { 

nlargestpiece = tnlargestpiece; 
largestpiece_sptr = 



largestpiecejcount = 
largestpiece_eptr = 
largestpiece_ecount = 



hold thisjplace sptr, 
hold_this_lcount; 
holdJhis_place_eptr; 
holdJhis_ecount; 

if(DEBUG_ALIGN) 
fprintf[stdout, 'TOUND a largest piece: leo=%d, epo(%d, %c), 
Ub[%d].name=%s (%d, %c)\n", 

nlargestpiece, 

largestpiece^ecount, *largestpiece_eptr, 
iUbjKbraryfilibjjiainejlargestpieceJcount, *largestpiece__sptr); 

} 

sptrf+; 
lcount++; 
eptrH-; 
ecount+-f; 

} else { 

tnlargestpiece « 0; 
sptrH-; 
lcoimt++; 
/*NEW*/ 

hold_thisjplace_sptr = sptr, 
hold_this_lcount - lcount; 
eptr = hold_this_place_eptr, 
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ecount = hold_this_ecount; 

} 

} else { 

tnlargestpiece = 0; 
break; 

} 

> 

tnlargestpiece = 0; 
eptr = hold_this_place_eptr + 1; 
ecount = hold_Jhis_ecount + 1 ; 
hold_this_place_eptr = eptr, 
hold_this_ecount = ecount; 

} 

ifl[DEBUG_ALIGN) fpiintf(stdout/ALIGaS[: largest piece match is %d monomers 
from %s\n w > nlargestpiece,library[ilib].name); 

if(DEBUG_ALIGN) Qmntf(stdout, n ALIGN: largestpiece_ecount=%d, 
largestpieceJcount==%d\n n , 

laigestpiece_ecount,largestpiece_lcount); 
if(nlargestpiece >= smallest_acceptable_piece) { 

if(DEBUG_ALIGN) frrintf(stdouV^\HGN: incoiporatedV); 
lcount = largestpiece Jcount; 
ecount = largestpiece_ecount; 
while(ecount < (nlargestpiece + largestpiece_ecount)) { 
epothilone->alignedsequence[ecount] = 
Ubraryfilib] .monomersequence[lcount] ; 

stn^y(epothflone->alignedPKSname[ecoimt],library[ilib] .name); 
strcpy(epotMone->conte*^^ 

epothilone->marked[ecoiint] = TRUE; 
if(ecount < (nlargestpiece + largestpiece_ecount - 
1)) epothilone->boundarytoright[ecount] = FALSE; 
lcount++; 
ecountf+; 

} 

} 

return (nlargestpiece); 
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}/^aximal_adjacent_ahgnment*/ 
/* 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the size of the largest maximal adjacent set of 
monomers inserted. 

PROCEDURE: 

*/ 

int maximal_adjacent_ali gnment_and_dump( 
LIB *epothilone, 
int nwildcard, 

char wildcards[MAXWILD] [MAXLEN], 
LB *library, 
int ilib, 

int smallest_acceptable_piece) 
{ 

int ii=0,jj-0,kk==0; 

int ecount=0,lcount=0; 

int elen=0; 

int epothilonelen=0; 

int nlargestpiece=0,tnlargestpiece=0; 

int hold_this_lcount=0, holdjhis_ecount=0; 

int wildcardmatch=FALSE; 

chpr *wptr, 

char *largestpiece_sptr,*largestpiece_eptr; 

char *hold__thisjplace_eptr, *hold_this_place_sptr; 

int Iargestpiece_lcount=0,largestpiece_ecount=0; 

char *sptr, *eptr, *lptr,*buQ)tr; 

if(DEBUG_WILDCARD) { 
if(nwildcard>0) { 

fprint^stdout/'maximal^: 
wildcards[0]=%s\n",wildcards[0]); 
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} 

} 

fprintf(stdout, n maximal^^ 
smallest_acceptablejiere^ 

sptr = library[ilib].monomersequence; 

eptr = epothilone->monomersequence; 

elen = strlen(epothilone->monomersequence); 

ecoimt=0; 

lcount=0; 

nlargestpieceFO; 

tnlargestpiece=0; 

holdjhis _place_eptr = eptr; 

hold_this_ecount = ecount; 

while (*eptr !- W) { 

sptr = library[ilib] jnonomersequence; 

lcount = 0; 

/* NEW */ 

hold_this_place_sptr = sptr; 
holdJhis_lcount = lcount; 
wildcardmatch = FALSE; 
while(*sptr!='V0'){ 

wildcardmatch = FALSE; 
if(epothilone->marked[ecount] = FALSE) { 

/* code for wildcards added MAS 05-16-00 */ 
wptr = M "; 

if(*eptr — "X) { wptr » wildcards[0]; } 
else if(*eptr = Y 1 ) { wptr = wildcards[l]; } 
else if(*eptr = 'Z f ) { wptr = wildcards[2]; } 
while(*wptr 

iff*\vptr===*sptr){ 

wildcardmatch = TRUE; 

break; 

} 

wptr++; 

} 
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if((wttdcardmatch = TRUE) || (*eptr = 

*sptr)) { 

tnlargestpiecfrH-; 

if(DEBUG_ALIGN) $rintf(stdout, 
"FOUND a match: len=%d, epo(%d, %c), lib[%d].name=%s (%d, %c)\n", 

tnlargestpiece, 

ecount, *eptr, Uib,library[ilib] .namejcount, *sptr); 

if(tnlargestpiece > nlargestpiece) { 

nlargestpiece = tnlargestpiece; 
largestpiece_sptr = 

hold_this_place_sptr; 

largestpiecejcount = 

hold Jhisjcount; 

largestpiece_eptr = 

hold_thisj)lace_eptr, 

largestpiece_ecount = 

hold_this_ecount; 

if(DEBUG_ALIGN) 
$rintf[stdout, "FOUND a largest piece: len=%d, epo(%d, %c), lib(%d, %c)\n\ 

nlargestpiece, 

largestpiece_ecount, *largestpiece_eptr, largestpiecejcount, 
*largestpiece_sptr); 

} 

sptr++; 
lcount++; 
eptrH-; 
ecount++; 

} else { 

tnlargestpiece = 0; 

sptr++; 

lcount-H-; 

hold Jhis__place_sptr = sptr, 
hold_this Jcount = Icount; 
/♦NEW*/ 

eptr = hoId_this_place_eptr, 
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ecoimt = hold_Jhis_ecoimt; 

} 

}else{ 

tnlargestpiece = 0; 
break; 

} 

} 

tnlargestpiece = 0; 

eptr = hold_thisjplace_eptr + 1 ; 

ecoimt = hold_this_ecount + 1; 

hold_this_place_eptr = eptr, 

hold_this_ecount = ecoimt; 

if(DEBUG_ALIGN) { 

^rintf(stdout,"incrementing 
holdjthis_place_epti=%c, 
holdJhis_ecxnmt^dW^*holdJft^ 

} 

} 

if(DEBUG_ALIGN) i5)rintf(stdout, M ALIGN: largest piece match is %d monomers 
from %s\n"^Jargestpiece4ibraxytilib].name); 

if(DEBUG_ALIGN) ^rintf(stdout,"ALIGN: largestpiece_ecount=%d, . 
largestpieceJcount=%d\n" ) 

largestpiece_ecount,largestpiece_lcount); 
if(nlargestpiece >= smallest_acceptable_piece) { 

if(DEBUG_ALIGN) §>rintf(stdout, ,, ALIGN: incorporated\n n ); 
Icount = largestpiecejcount; 
ecoimt = largestpiece_ecount; 
^Jrintf(stdout,"ALIGN_TARGET: w ); 
foi(ii=0; ii<largestpiece_ecourit; ii++) { 
fprmtf(stdoiJt, " "); 

} 

while(ecount < (nlargestpiece + largestpieee_ecount)) { 
epothilone-c>alignedsequence[ecount] = 
library{ilib].monomersequence[lcoimt]; 

stn^y(qx)thilone->aUgnedPKSname[ecoimt],Hbraiy^ 
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strcpy(epothilone->com 

q3othilone->marked[ecount] = TRUE; 

if(ecount < (nlargestpiece + largestpiece_ecoimt - 
1)) epothilone->boimdarytoright[ecount] = FALSE; 
Q>rintf(stdout,"%c n JihraiyIilfl)].m(mamcre 

lcount-H-, 

ecounH-f; 

} 

for(ii=ecoimt; ii<feien; ii-H-) { 
frrintf(stdout, n "); 

} 

^rint^stdout, n %s\n M 4il)rary[iHb]jiaine); 

} 

return (nlargestpiece); 
}/*maximal_adjacent_alignment_and_dump*/ 
int . output_ftesh_alignment( 
LIB *epotemp) 
{ 

int ecount=0; 
char *eptr; 

char boundaryjMAXNAMELENJ; 

eptr = epotemp->monomersequence; 
econnt - 0; 

epotemp->nboundary = 0; 

stn^y(boundary,epotemp->aHgnedPK^ 

whilepeptrNWM 

if(epotemp->bo\mdarytoright[ecount] = TRUE) { 
epotemp->nboundary++; 

} 

ecount-Hf; 
eptr++; 

} 

if(epotemp->nboundary > NBOUNDARY_CUTOFF) return 1; 
eptr = epotemp->monomersequence; 
ecount «= 0; 
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fprintftstdout.'TOT M ); 
while(*eptr != \V) { 

ifl[epotemp^>alignedPKSiiame[ecount][0] = W) { 

if(epotemp->boiiDdarytori^it[ecoimt] =TRUE) { 
^rintf(stdout, ,,0 /oc:TARG( 0 /os)| 
"^eptr^epotemp-^ntexttecoirat]); 
}else{ 

^rint£(stdout, n %c:TARG(%s) 
"^^tTjepoten^-^ntextJecount]); 
} 

} else { 

if(epotemp->boundarytoright[ecount] = TRUE) { 
^printf(stdout, ,,0 /oc:%4s(%s)| 
Veptr,epotemp^aHgnedPKSnaine[ecotmt^^ 
} else { 

j5)Tint^stdout, ,lo /oc:%4s(%s) 
\*eptr,epotemp^ahgnedPKSname[eco^ 

} 

} 

ecount+4-; 
eptrf+; 

} 

iprintf[stdouV f npiece %d ft ,epotemp->nboundary); 

Q>rintf(stdout,"\n n ); 

return 1; 
}/*output_finesh_alignment*/ 
int get_library( 
char *libraryfile, 
LIB *library) 

{ 

int ii=OjjK)Jkk=0,lcount==0; 

int nlib=0; 

char *bulptr,buf[MAXBUF]; 

char tmonomersequence[MAXNAMELEN] ; 

char *lptr,*tptr, 
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FILE *libp; 

for(kk=0; kk < MAX_LB_ENTRIES; kk++) { 
library[kk] jecureion_tagged = FALSE; 
foi(ii=0; ii<MAXNAMELEN; n++) { 
library|T£k].name[ii] = \0 U , 
libraryfkk] jnonomersequence[ii] = "\Q'; 
hTwary[kk].aHgnedsequance[ii] = *\0'; 
foi(ij=0; jj<MAXLEN;jj++) { 

Iibrary|Tdc].aIigoedPKSnametij][ii] = ■VO*; 
Ebtarytkk]jimked[ij] = FALSE; 
Hbrary[ldc].context[ij][0] = \V; 
nbrartfkk]xontext|]j][l] ■ •\0'; 
/ Ubrary[kk].cante3rt[jj][2] = W; . 
Hbrary[kk].context[ii][3] = 'VO'; 

} 

} 

library[kk].nboundary= 0; 

> 

/* read in the library from PKS.lib */ 
i^NUIJLF^l)p^open(lib^ { 

fpmtf(stdout, , TRY AGAIN; couldnt open %s\n"Jibraryfile); 

nlib=0; 

exitO; 

} 

nlib=0; 

while(nlib < MAXJLJB^ENTRIES) { 

ifl?HJLI^^gets(buf;dzeof(biif),hbp)) break; 
bufptr=buf; 

if(*bufptr = Vf) continue; 

ifPbufptrNWH 

lptr =» Iibrary[nlib] Jiame; 

while^bufptrt-' •) && (*bufptr != W) && 

(*bufptrI=W)){ 

*lptrH- = *bufr>trH-; 

} 
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•lptr^O'; 

ii((*bufptr != W) && (*bufptr != V)) bufptr++; 
Iptr = library[nlib] jnonomexseqaence; 
while((*bufptr !=• ") && (*bufptr != •\0 r ) && 

(*bufptr!=W)){ 

I* This code specifically deletes inter-modular 
double bonds, optional.*/ 

if(*buftrtr!=^){ 

*lptr++ = *bu$tr++; 

} else { 

bufptr++; 

} 

} 

♦lptr^O'; 

if((*bufptr != \0T) && (*bmptr \- W)) bufptr++; 
Iptr = library[nlib].annotatedsequence; 
while((*bufptr!=' O && (*bufptr != -VO*) && 

(*bufptr!=V)){ 

*lptr++ = *bufptr++; 

} 

*lptr=^'; . 

if((*bufptr != \V) && (*bmptr != W)) bufptr++; 

q>rintf(stdout, ,, LIBRARY(%d) %s: 
%s\n"^bJdbrary[nh*b].nameJibrary[nlib].monomersequence); 

ftjrintfCstdouV'LIBRARYCyod) %s: 
%s\n"^b,hbiary[nhl>].name,Ubra^ 

nlib++; 

} 

} 

fclose(libp); 

for(kk=0; kk< nUb; kk+f ) { 
lcount = 0; 

Iptr = libraryjlck] .monomersequence; 
wbile(*lptr!= , \0'){ 

if(lcount = 0) { 
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hl>rary[kk]:contextllcount][0] = 
library[kk].context[lcount][l] = *lptr, 
hT>iary[ldc].context[lcount][2] = *(lptr+ 1); 

} else { 

ifQcount^ 1 
(strlen(Hbrary[kk] .monomersequence) - 1)){ 

Ubrary[kk] .context[lcount] [0] = 



*(lptr-l); 
*lptr, 



*0ptr-l); 
*lptr, 

*0ptr+l); 



library|Tkk].context[lcount][l] = 
Uhrajytkk].context[lcount][2] = 

}else{ 

Ubrary|Tck].context[lcount][0] = 
libraryfkk].context[lcount][l] = 
Ubrary[kk].context[lcount][2] = 
} 

> 

library[kk].context[lcount][3] = 'NO 1 ; 

lptrH-; 

IcoimtH-; 

> 

^>rintf(stdout > "LIBRARY(%d) %s: 
%s\n"Jkk,Ubrary|>ck]jiame^^ 

Qjrintf(sldout, ,, LIBRARY(%d) %s: 
%s\n n ^libraryl>lc]jiame^T)raiy[kk].annotatedsequence); 

for(u^;jB<strlen(bTnary{^].monomerseqiience);ju++) { 
^rintf(stdout,"(%s) n JQbrary[lck].context[ij]); 

} 

^rint^stdout.'V); 

} 

return nlib; 
}/*getJibrary*/ 
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int dump_STARTER_align( 
LIB epothilone, 
int nwildcard, 

char wildcards[MAXWILD] [MAXLEN] 

) 

{ 

int elen=0; 

int ecount=0^iold_ecount=0; 
int wildcaidmatch=FALSE; 
char *sptr,*eptr,*wptr; 

elen = strlen(epothUone.monomersequence); 
eptr = epothilone.monomersequence; 
fprintffetdout, "ALIGN.TARGET: "); 
while(*eptr!= l \0'){ 

^rintffstdou^^/ocVeptr); 
eptrH-; 

} 

§)rintf{stdout,"\n"); 
ecount=0; 

eptr = epothilonamonomersequence; 
sptr = epothilone.alignedsequence; 
iprintf(stdout, "ALIGNJTARGET: "); 
while(*eptr != W) { 

wildcardmatch = FALSE; 

wptr=""; 

if(*eptr = ^X 1 ) { wptr = wildcards[0]; } 
else if(*eptr = V) { wptr = wildcardsfl]; } 
else if(*eptr — 'Z 1 ) { wptr = wildcards[2]; } 
while(*wptr != Vf) { 

if[*wptt=*sptr){ 

wildcardmatch = TRUE; 

break; 

} 

wptrH-; 

} 
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if(wildcardmatch = TRUE) { 
$rintf[stdolIt, 0 : f, ); 

} else { 

if(*eptr = *sptr){ 

^ri^tf(stdout, w | ,, ); 

}else{ 

fprintf(stdout, " "); 

> 

} 

eptrH-; 
sptrH-; 

} 

fyrint^stdout,^); 
fprintf(stdcmt, "ALIGNJTARGET: 
eptr = epothilone.monomersequence; 
ecount=0; 

while(*eptr != f \0 , ){ 

ii(epothilone.alignedsequence[ecount] — l \00 { 
$rintf(stdout, ,? 

} else { 
fprintflstdouV^c 1 ^^ 

hold_ecount = ecount; 

} 

eptH+; 
ecount++; 

> 

j^rintfltstdout," 

%sW\^otiiaone.aUgnedPKSname(told_ecomit]); 

§>rintf(stdout, ,, STARTER_ALIGN:\n n ); 
}/*dump_STARTER_align*/ 

reset_epotemp( 
LIB *epotemp, 
LIB epothilone) 
{ 

int jj=0, elen=0; 
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elcn = strlen(epothilone.monomersequence); 

strcpy(epotemp->name > epothilone.name); 

strqjy(epotemp->mQnomersequence,epothilonejnonome^ 

strcpy(epotemp->aUgnedsequcnce,epoMone.aKgp 

epotemp->nboundary = 0; 

for(ij=0; jj< elen; jj++) { 

strcpy(epotemp->ahgnedPKSn^ 

epotemp->maiked[jj] = epothilone.marked[jj]; 

strc^y(epotemp->context|jj] s epothilone 

epotemp->boundarytoright|jj] = epothilcme.boundarytoright£jj ] ; 

} 

}/*reset_epotemp*/ 

EXAMPLE 6 

Source Code: 
#include <5tdio Ji> 

/* ~siani/programs/morph/morph4.c 

PURPOSE: To recursively traverse all the entries in PKS.lib, generating all feasible 
combinations of PKS modules to make the TARGET (e.g., epothilone). 

INPUT: -b number_boundary_cuto£f: lets user set the maximum number of 
boundaries in output lines. This defaults to 5 (#define NBOUNDARY CUTOFF 5) which is a 
reasonable assumption for something of the length of epothilone (8 modules). However, when 
looking at disco-dennolide which has 1 1 modules, a cutoff of 5 sometimes results in too few 
output lines; it is too restrictive. 

-d allows one to ignore the inter-modular doublebonds in the library file. 

-1 libraryfile: tab-delimited CHUCKLES-coded polyketides file with the following 
columns 

1. polyketide name 

2. plain CHUCKLES 

3 . annotated CHUCKLES (contains information about post-synthetic 

modifications) 

4. source organism 

-n targetname: user-defined name (e.g., epoD) 

-t targetsequence: CHUCKLES-coded polyketide of desired TARGET (e.g., 
MEMLJDGE) 

-w, -x, -y, -z sets of wildcards: sets of monomers for particular positions appearing in 
targetsequence. The wildcards can effectively be used for analoging the TARGET polyketide. 

Hard-coded parameters which may be reset (requires recompiling): 
#defineNBOUNDARY_CUTOFF 5 

NBOUNDARY^CUTOFF determines the maximum number of non-native inter- 
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modular interfaces which are contained in the output This is now set to 5, but maybe increased 
when the user does not care about inefficiencies introduced by these interfaces or when the 
targetsequence is very lengthy. 

SdefineRECmSION^COUNim^CUTOFF 2 

RECURSION_COU>n^_CUTOFF specifies the number of levels of recursion 
(defaults to 0, 1,2) acceptable for the ran. This limit must be set since the large PKS library can 
result in recursion that will combinatorially explode. Because of the multi-directionality of the 
alignments (using every library entry as a STARTER), there is no need to go beyond 2 levels of 
recursion. However, there may be cases in the future where this number should be increased 
Note that while recursion will eventually terminate without this parameter, runs with a library 
over about 20 PKS entries may run for years on a reasonably fast computer. 

OUTPUT: All combinations of modules that meet parameters set by user. 

Example output from MEMLJDGE (epothilone D) using subset of PKS.lib. 
Vertical bars indicate non-native inter-modular interfaces. Last column contains the number of 
"pieces" that are needed to put together the PKS. 

Names of PKSs have been abbreviated to fit them in these comments. 
HIT M:3atyl(FMN)| E:tedan(GEH)| M:aldga(BML) L:aldga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HIT M:aIbMl(LME) E:albMl(MEJ)| M:albMl(LME)| L:aIdga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HIT M:albMl(LME) E:albMl(MEJ)| M:aIdga(BML) L:aldga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:3atyl(NGO)| E:albMl(MEJ)| 5 

HIT M:aIbMl(LME) E:albMl(MEJ)| M:aldga(BML) L:aldga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:aldga(LGJ)| E:albMl(MEJ)| 5 

HIT M:aIbMl(LME) E:aIbMl(MEJ)| M:aldga(BML) L:aldga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:aldga(LGJ)| B:albMl(MEJ)| 5 

USAGE: 

morph3 -1 libraryfile -n targetname -t targetsequence [-w W-wildcards] [-x X- 
wildcardsj [-y Y-wildcards] [-z Z-wildcards] -d 

examples: 

# generate combinations that yield epothilone D 

%morph3 -1 PKS.lib -n epoD -t MEMLJDGE > omoiph3_epoD 

%egrep HIT omorph3_epoD | sort | uniq | sort +10 -1 1 > 
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omoiph3_epoD.uniq.sort 

%egrep ALIGNTARGET omorph3j5poD > 
omoiph3_epoD_STARTER_ALIGN 

# generate combinations that yield epothilone D with a C13-hydroxyl 
%morph3 -I PKS.Kb -n epoD-130H -t MEXLJDGB -x ABCD > oepoD-130H 
%egrep HTT oepoD-130H | sort | uniq | sort +10 -11 > oepoD-130H.uniq.sort 
%egrep ALIGNJTARGET oepoD-130H > oepoD- 1 30HJSTARTER_ALIGN 

# generate combination that yield epothilone with the following wildcards (set 1) 
%morph3 -1 PKS.lib -n epoD-setl -t MEXYZDgB -x ABCD -y LBFIN -z 

JACGM > oepoD-setl 

%grep HIT oepoD-setl | sort | uniq | sort +10 -1 1 > oepoD-setl.uniq.sort 

# generate combination that yield epothilone with the following wildcards (set 2) 
%moiph3 -1 PKSlib -n epoD-set2 -t MEXYZDgB -x JK -y EF -z JACGM > 

oepoD-set2 

%grep HTT oepoD-set2 | sort | uniq | sort +10 -1 1 > oepoD-set2.uniq.sort 
LIMITATIONS: 

Current implementation cannot handle intra-modular modifications/splitting 
because morph is operating at the monomer level. Future implementations could convert the 
CHUCKLES-encoded strings into the corresponding and equivalent SMILES and then perform 
more complex chemical analysis of the PKS molecular graphs. Currently, inter-modular double 
bonds are present in the library, but are ignored by the morph program. 

MODIFICATIONS: 

+ added ability to include user-defined wildcards (X, Y,orZ) on the 

command line. MAS 05-16-00. 
+ added additional wildcard (W). MAS 05-30-00. 
+ added addition (summary) column to HIT output list. MAS 05-30-00. 
+ added command line argument for suppressing the inter-modular double bonds 
in the library. Default is not to treat these as separate modules. MAS 05-3 1-00. 

+ added column that contains the length of the largest matching fragment MAS 

06-05-00 
♦/ 
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#defineTRUE 1 
#defineFALSE 0 
#defineDEBUG__MATCH FALSE 
MefineDEBUGjSTARTER FALSE 
#defineDEBUG_ALIGN FALSE 
MefiaeDEBUGJRECURSE FALSE 
#defineDEBUG_WILDCARD FALSE 

#defineMAXLEN 80 

#defineMAX_TYL_LEN 6 

#defineMAX_EPO_LEN 6 

#defineMAXNAMELEN 500 
#defineMAX_LIB_ENTRIES 500 
#defineMAXWILD 4 
#dej5neMAXBUF 1000 

#defineNBOUNDARY_CUTOFF 5 
#defineRECmSION_COUNTER_(^ 2 
#defineSTARTER_MINIM 2 
#defineM!NIMUM_ADJACENT_ALI^ 2 

typedef struct _lib { 

char name[MAXNAMELEN] ; 

char monomersequence|>lAXNAMELEN] ; 

char aimotatedsequeace[MAXNAMELEN]; 

char. aUgnedsequence[MAXNAMELEN] ; 

r 

char aUgnedPKSnamefMAXIJe^ 

int boundarytoright[MAXNAMELEN]; 

int maiked[MAXLEN]; 

char context[MAXLEN][4]; 

int recursion_tagged; 

int nboundary, 
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main(int argc, char **argv) 
{ 

int ii=0,ij=0,kk=0,U=0; 

int nlib=0; 

int ecount=0; 

int nfilled=0,nfilledmax=0; 

int epothilonelen=0; 

int nlargestpiece=0,tnIargestpiece^=0; 

int mmpass=0; 

int lcount=0; 

int new_unmarked_entries_filled=0; 

int recursion_counter = 0; 

int nwildcard=0; 

int best_new_uninarked_entries_filled = 0; 

int smallest_acceptable_j)iece = 0; 

int current_nmariced==0 5 previous_ninarked==0; 

int inter_modular_db_flag_off = FALSE; 

int nboundary_cutofi^>TOOUNDARY_CUTOFF; 

char *sptr, *eptr, *lptr,*bufptr; 

char *clibptr; 

char *libraryfile; 

char *targetsequence,*targetname; 

char buf[MAXBUF]; 

char wildcarfsfMAXWILD] [MAXLEN] ; 

FILE *Iibp; 

LIB epotemp; 

LIB hT)raiy[MAX_LIB_ENTRIES] ; 

LIB epothilone; 



char *progname; 

char **filelist, **fileptr, 
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iibraryfile = nn ; 
targetsequence = n "; 
taigetname = 



for(ii=0; ii<MAXWILD; ii++) { 

for(ij=0; jj<MAXLEN;jj++) { . 
wildcalds[ii][ij] = ^0 , ; 

} 

} 

/* process arguments */ 

filelist = fileptr = (char **)(malloc(argc * sizeof(*argv))); 



progaame = *argv++; . 
if(argc<2) { 

^)rintf(stdeir, ,, usage:%s [-b nboundary_cutofi] [-d] -1 libraryfile -n targetname -t 
targetsequence [-w W-wiidcards] [-x X-wildcards] [-y Y-wildcaids] [-z Z-wildcards] 
\n",progname); 

exitO; 

} 

while(argc~ > 1) { 

if(argv[0][0] — && aigv[0][l] != Vf) { 
/* handle option */ 

*+K*argv); /* advance past the minus */ 
switch(**argv) { 

case V: /* get number of boundaries cutoff for output of 

alignments */ . 

argv++; argc-; 

sscan^argvfOJ/^^&nboundaiy^utoff); 
fprint^stderr^-b: 
nboimdaryjcuto£N%dW , ^boundary_cutoflE); 

break; 

case 'd': f* ignore inter-modular double bonds in the library file */ 
inter_modular_db_flagLofiF = TRUE; 



PCT/US01/17352 

69 

fprint^stderr^-d: inter-modular double bonds ignored An"); 
break; 

case T: I* get library input filename (PKS.lib) */ 
argv++; argc-; 
libraryfile = argv[0]; 

iprintf(stderr, n -l: hT>r«ryjHe^s\n"Jfl)raryfile); 
break; 

case V: f* get target name string */ 
argv++; argc-; 
targetname — argv[0]; 

fprintf(stderr, H -t: targetname^s^.targetname); 
break; 

case t': f* get target sequence string */ 
argv++; argc-; 
targetsequence = argv[0]; 

fprintf(stderr,"-t: targetsequence=%s\n",targetsequence); 
break; 

case W: f* get a wildcard string */ 
argv++; argc-; 
stxcpy(wildcards[0],argv[0]); 

Q)rintf(stderr,"-w: wildcards[%d]=%s\n",0,wildcards[O]); 

nwildcard++; 

break; 

case V: f* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[l ],argv[0]); 
• fprintffstderr^-x: wildcards[%d]=%s\nM,wildcards[l]); 
nwildcard++; 
break; 

case y : f* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[2j,argv[0]); 

Jprintf(stdenv n -y: wildcards[%d]=%s\n r, ,2 > wildcards[2]); 

nwildcard++; 

break; 
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case 'z 1 : /* get a wildcard string */ 
argv++; argc—; 
strcpy(wildcards{3],argv[0]); 

^rintfCstderr/'-z: wildc^^dH/os^^^dcardsP]); 
nwildcard++; 
break; 

case W: /* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[O],argv[0]); 

$rintf(stiOT,"-w: wildcaids^J^/os^O.wadrards^D; 
nwildcard++; 
break; 

case 1C: /* get a wildcard string */ 
argvH-; argc-; 
strcpy(wildcards[l],argv[0]); 

fprmtfCstderr/'-x: wildcanis[%d}= 0 /os\n ff ,l,wildcards[l]); 
nwildcard++; 
break; 

case Tf 1 : /* &et a wildcard string */ 
argv-H-; argc-; 
strcpy(wildcards[2],argv[0]); 
, iprintftstdear/'-y: wildcards[%dK/os\n n ,2,wildcards[2]); 
nwildcard++; 
break; 

case '27: /* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[3],argv[0]); 

j^rint^stderr/'-z: wadcards[%d]^/os\n"^,wadcaids[3]); 
nwildcard++; 
break; 
default 

^rint^stderr, 1 ^ unknown option; ignored\n f \*argv); 

}/*switch*/ 
} else { /* a regular filename */ 
*fileptrf+ = *argv; 
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*fileptr = NULL; 

} 

argy++; 

}/*while*/ 



if(nwadcard>0) { 

foi^iM); ii<nwildcard; ii-H-) { 
iprintf(stderr,Vildcar^ 
fpriatf[stdouVVM^ 

} 

} 

epothilone Jiboundary = 0; 

for(ii=0; ii<MAXNAMELEN; ii++) { 

epothilone.name[ii] = W; 

epothilonejnonomersequence[ii] = W; 

epothilone.alignedsequence[ii] = AO*; 

epotibaone.boundarytori^it[ii] = TRUE; 

foi<jS^;ij<MAXLEN;ij-H-) { 

epothilone.alignedPKSname|jj][ii] = \0 9 ; 
epothilone.marked[jj] = FALSE; 
epothilone.context[jj][0] = W; 
epothilone.context[ii][l] = W; 
epothilonexontext[jj][2] » W; 

} 

} 

strq>y(q>otlulonejiame,targetnaine); 
stnyy(€pothaone.monomerseqiience,targetseqiience); 

fprintffctdout, "TARGET: %s\n M , epothilone.monomersequence); 



ecount = 0; 
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eptr = epothiloncmonomerseqnence; 

whUe(*eptr!= , V0'){ 

Recount = 0) { 

epothilone.context[ecount][0] = 
q>othilone.context[ecoimt][l] = *eptr; 
epothilone.context[ecount][2] = *(eptr + 1); 

} else { 

Recount === (strlen(epothilone.monomersequehce) - 1)){ 
epothilone.context[ecount][0] = *(eptr - 1); 
epothilone.context[ecoimt][l] = *eptr, 
epothilone.context[ecount][2] = 

}eke{ 

q>othilone.context[ecount][0] = *(eptr - 1); 
epothilone.context[ecount][l] = *eptr; 
epothilone.context[ecount][2] = *(eptr + 1); 

} 

} 

epothttone.context[ecoimt][3] = "NO 1 ; 

eptrH-; 

ecountf+; 

} 

for(ii=0; ii<ecount; ii++) { 
iprinti^stdout,^ 

} 



/* library */ 

nlib = getJibrary(UbraryfUe4fl)r^ 
j^rintfCstdou^^b^/odV^lib); 



kk=0;whfle(kk<nlib){ 
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/* zero out the epothilone entry with respect to anew alignment — — */ 

for(ii=0; ii<MAXNAMELEN; ii++) { 

epothflone,aHgnedsequence[ii] = W; 
epothilone.boimdarytori^tfti] = TRUE; 
forQ)=0; jj<MAXLEN;jj+f) { 

epothilone.a]ignedPKSname[ij][ii] = W; 
epothilonejnarkedjjj] - FALSE; 

} 

} 

I* reset the context back to that in epothilone */ 

ecount = 0; 

eptr = epothilonejnonomersequence; 
while(*eptr!=\0 , ){ 

if(ecount = 0) { 

epothilone.context[ecount][0] » '-; 

epothilone.context[ecount][l] = *eptr; 

epothilone.context[ecount][2] = *(eptr + 1); 

} else { 

if(ecount = (strlen(epothilone.monomersequence) - 1)){ 
epothilone.context[ecount][0] = *(eptr- 1); 
epothilone.context[ecount][l] = *eptr; 
epothilone.context[ecount][2] = - f ; 

} else { 

epothilone.context[ecount][0] = *(eptr - 1); 
. epothilone.context[ecoimt][l] = *eptr, 
epothilone.context[ecount][2] = *(eptr + 1); 

} 

} 

epothilone.context[ecount] [3] = W; 

eptrf+; 

ecount-H-; 

} 



aligQ STARTER (current library entry) and epothilone */ 



WO 01/92991 PCT/US01/17352 

74 



sptr = libraiy[kk] jnonomCTsequcnce; 
lcoimt=0; 

while(*sptrt= , \0 , ){ 

fprintf(stdouVTibrary^ 

ldc,l<x>mt,hT>rary|^]j^^ 

sptrf-f; 
lcount++; 

} 

/* Call maximal_adj acent_alignment until it no longer returns more than 
two adjacent modules. There is really no reason to try to extract 
individual modules because this will be done as part of the 
recursive filling of spaces from the library. 

*/ 

smallest_acceptablejriece = 2; 
eptr = epothilone.monomersequence; 
iprintf[stdout, "ALIGNTARGET: "); 
while(*eptr 

§>rintf(stdout, ,, %c ,, *eptr); 

eptrH-; 

} 

ftrinl^stdout/V 1 ); 

iprintftstderc/'ahgning %d %s\n",kk, library[kk].name); 

best_new_unmarked_entries_fil]ed = 0; 
while((new_unmarked__entries_filled = 
maximal_adjacent_ahgnment_and^ 

acceptable_j)iece)) >= STARTER^MIMMUM^ADJACENT^ALIGN) { 

if(best_new_unmaxked_entries_filled < new_unmarked_entries_filled) { 
best_new_unmarked_entries_filled => 
new_unmarked_entries_fi]led; 

} 

if(DEBUG_STARTER) $rintf(stdout, "STARTER ALIGN: 
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new_uinnarked_entries^^ 

epothilonelen = strlen(epothUonejnonomersequence); 
for(ii=0; ii< epothilonelen; ii++){ 

if(DEBUG_STARTER) fprintf(stdout, "STARTER ALIGN:found 
a best alignment between epo.monomer[%dJ=%c in library[%d] jiame^s^", 

ii,epothilone.monomercequmce[n^ 
} 

} 

libraryfkk] jecursion Jagged = TRUE; 

§>rintf(5tdout,"ALIGN„TARGET: \n"); 
dump_STARTER_align(epotMlone^wildcard,wildcards); 
fprintf(stdout,"ALIGNjrARGET: \n"); 



if(bestjnewjinmarked_entries J _filled <= 1) { 

fprintf(stdouV T ALIGN_TARGET: PROBLEM 
bestjnew_unmarked^entries_fiUed = %d\n",best_new_unmaiked_entries_filled); 

Q)rintf(stdout, M ALIGN_TARGET: PROBLEM skipping this STARTER 
entry for library[%d] .name=%s\n n , 

kk,libraiy[kk]jianie); 
libraryfkk] jecursionjagged = FALSE; 
kk++; 
continue; 



/* fill in the gaps from the library */ 

/* generate a fresh copy of epothilone in epotemp */ 
epothilonelen = strlen(epothilone.monomersequence); 
nfilledmax » strlen(epothilone.monoraersequence); 
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fprintf(stdout, "nfilledmax=%d\n n , nfiDedmax); 

ieset_epotemp(&epotemp,epothilone); 
nfilled = 0; 

for(u==0; ii< epothilonelen; ii++) { if(epotamp jna±ed[ii] = TRUE) nfilled++; } 
ifpEBUGJSTARTER) 4>rintf(stdoiit, ,, nfiUed from STARTER^/od\n n ,nfilled); 

for(mmpass = 0; mmpass < nlib; mmpass++ ) { 

if(mmpass = kk) { continue; } 
reset_epotemp(&epotemp,epothilone); 



nfilled = 0; 

for(ii=0; ii< epothilonelen; ii++) { if(epotemp.marked[ii] = TRUE) 

nfilled++;} 



if(nfilled >= nfilledmax) { 

ou^ut_fi^h_aUgnment(<^otemp,nboundary_cutofE); 

} else { 

cuirent_nmarked = nfilled; 
previousjomarked = nfilled; 
smaUest_acceptable_piece = 2; 
while((new_unmarked_entries_.filled = 
maximal_adjacent_aHgnment(&e^ 
le_piece)) >= MIN1MUM__ADJACENT_ALIGN) { 

currentnmarked += new_unmarked_entries_filled; 
if(DEBUGJMATCH) fprintf(stdout, "main: 
recursion Jevel=%d, mmpass=%d, previous_nmarked=%d, current_nmarked=%d\n", 

recursion_counter,nunpass,previous_nmaiked, 

current jraiaiked); 

} 
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nfilled = 0; 

foi(ii=0; ii< epothilonelen; ii-H-) { if(epotemp jnarkedpi] = 

TRUE)nfilled-H-;} 

Unfilled >= nfilledmax) { 

output_fresh_aHgmnent(&epot^^ 
continue; f* no need to xecurse */ 

} 

if(DEBUG_MATCH) fprintf(stdout, "main: about to RECURSE: 

mmpass==%d\n" > mmpass); 

Kbrary[mmpass] jecursionjagged = TRUE; 
recursion_countei++; 

recurse_through_the_Hbrary(iifiUedmax,epotem 
ibraiy,&recursion_counter^iboundary_cuto£f); 

libraryfmmpass] ,recursion_tagged = FALSE; 
recursion_counter~-; 

} 

} 

Ubrary[kk] j:eeui8ion_tagged = FALSE; 
kk++; 
}/*nlib*/ 

}/*main*/ 



PURPOSE: 
INPUT: 
OUTPUT: 
PROCEDURE: 

*/ 

int recursejhroughjhe_library( 
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int nfilledmax, 

LIB epotemp, 

LIB *epothilone, 

int nwildcard, 

char wildcardsfMAXWILDltMAXLEN], 

int nlib, 

LIB *library, 

int *recursionjoounter, 

int nboundary_cutoff) 

{ 

int ii=0; 

int ecount=0,elen=0; 

int mmpass^; • * 

int nfilled=0; 

int lcount=0; 

int previous_nmarked=0, current jamaiked=0; 

int smallest_acceptablejpiece =» 0; 

char *eptr; 

char *clibptr; 

char boundary[MAXNAMELEN] ; 

int new_immarked_entriesj&lled=0; 

LIB epotenip_temp; 

if(DEBUG_MATCH) ^rintfl[stdout, n RECURSE: recursion_counter=%d, 
nlib=%d\n r, ,*recursion_counter > nlib); 

clen = strien(epotemp.monomersequence); 
nfilled = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp.marked[ii] — TRUE) nfiUed++; } 
previous_nmarked = nfilled; 
currentjomaxked = nfilled; 
smallest_acceptable_piece = 1; 
Unfilled >= nfilledmax) { return 1; } 



for (mmpass = 0; mmpass < nlib; mmpass++) { 
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if(*recursion_coxmter >= RE(XRSION_COUNTER_CUTOFF) { 
return 1; 

} 

if(DEBUG__MATCH) ^rintf(stdout, H RECXJRSB: reairsion_coTinter==%d, 
mmpass==%d\n^*recura^ 

if(library[iiimpass] jecuisionjagged = TRUE) { 

if(DEBUG_MATCH) fprintf(stdout,^CURSE: 
library[%d] jecursion_tagged=TRUE; skipping\n",mmpass); 
continue; 

} 

reset^epotanpC&cpotemp^tempjepotemp); 

elen = strlen(epotemp_temp.monomersequence); 
nfilled = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp Jemp.marked[ii] = TRUE) nfilled++; } 
previous_nmarked = nfilled; 
cuiTent_jHnarked = nfilled; 

while((new_unraarked_entries_filled = 
maximaI_adjacent_alignment(&q)otenip_temp, nwildcard,wildcards,library, 
mmpass,smallest_acceptable_piece)) >= 1) { 

currentjomarked += new_unmarked_entries__filled; 
if(DEBUG_MATCH) fprintflstdout, "RECURSE: recursion Jevel=%d, 
mmpass=%d, previous jimarked=%d, current__nmarked=°/od\n n , 

♦recumon^counter^mmpasSjprevious^nmarked, 

current_nmarked); 

} 

elen = strlen(epotemp_temp.monomersequence); 
nfilled = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp_temp.mflrked[ii] = TRUE) nfilled++; } 
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if(nfilled >= nfilledmax) { 

output_fi^_aligmnent(&epotemp_tmip^boundary_cutofiO; 
continue; 

} 

library[mmpass] jecursionjagged = TRUE; 
(*recursion_counter)++; 

recursejhix>ugh_the_ttbrary(^^ 
nlib,libiaiy/eaiisicm_(x^^ 

■ library[mmpass].recursion__tagged = FALSE; 
(*recuision_counter)~; 

}/* nunpass */ 

}/*reciirse_tlirough_the_library*/ 

/* . 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the size of the largest maximal adjacent set of monomers inserted. 
PROCEDURE: 

*/ 

int maxim al_adj acent_alignment( 
LIB *epothilone, 
int nwildcard, 

char wadcardstMAXWnJD][MAXLEN], 
LIB *library, 
int ilib, 

int smallest_acceptablejpiece) 
{ 

int ii=0,jj=0,kk=0; . 
int ecount^Jcount^; 
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int epothilonelen=0; 

int nlarges^)iecep=0,tnlargestpiecep=0; 

int hold Jhisjcoimt=0, hold_this_ccount=0; 

int wildcarfmatch=FALSE; 

char *wptr, 

char *largestpiece_sptr,*largestpiece__eptr, 

char *holdJhisjplace_eptr > *hold_this_place_sptr, 

int Iargestpiece_lcoimt=0,largestpiece_ecount=0; 

char *sptr, *eptr, *lptr,*bufptr; 

if{DEBUG_WILDCARD) { 
if(nwildcard>0) { 

5>rintf(stdout,"maximal_: wildcardsfOJ^sW.wildcaiTdstO]); 

} 

} 

ii(DEBUG_ALIGN) j^rintf(stdout, ! taaximd_adjac^nt_aUgnment: 
smallest_ac^^tablejiece^d\n tt ,smallest_acceptable_piece); 



sptr = library[ilib].monomersequence; 

eptr = epothilone->mononiersequence; 

ecount=0; 

lcount=0; 

nlargestpiece=0; 

tnlargestpiece=0; 

hold_thisj>lace_eptr = eptr; 

holdjhis_ecount = ecount; 

while (*eptr 1= •Vff) { 



sptr = library[ilib].monomerseque3tice; 
Icount^O; 

hold_this_place_sptr = sptr, 
holdjhisjcount = lcount; 
wildcardmatch = FALSE; 
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whilepsptrNWOt 

wildcardmatch = FALSE; 
if(epothilone->marked[ecoimt] = FALSE) { 

/* code for wildcards added MAS 05-16-00 */ 

wptr= M "; 

if(*eptr = W) { wptr = wildcards[0]; } 
else if(*eptr — X 1 ) { wptr = wildcards[l]; } 
else if[*eptr = r Y T ) { wptr = wildcards[2]; } 
else if(*eptr = 'Z) { wptr = wildcards[3]; } 

while(*wptr \=W) { 

if(*wptr = *sptr){ 

wildcardmatch = TRUE; 
break; 

} 

wptr++; 

} 

if[(wildcanlmatch === TRUE) || (*eptr = *sptr)) { 



tnlargestpiece-H-; 

if(DEBUG_ALIGN) fprintf(stdoiit, "FOUND a match: 
len=%d, epo(%d, %c), Ub[%d].name=%s (%d, %c)W\ 

tnlargestpiece, ecount, *eptr, 

Uibjibraryfilib] Jiame^lcoimt, *sptr); 

if(tnlargestpiece > nlargestpiece) { 

nlargestpiece = tnlargestpiece; 
laigestpiecejsptr = liold_this_place_sptr; 
largestpiece Jcount « hold_this_lcount; 
largestpiece__eptr = hold_this_place_eptr, 
laigestpiece_ecount = hold_this_ecount; 
if(DEBUG_ALIGN) frrintfltstdout, "FOUND a 
largestpiece: len=%d, epo(%d, %c), Kb[%d]jiamep=%s (%d, %c)\n", 

nlargestpiece, largestpiece_ecount > 
*largestpiece_eptr, ihT)Jfl)raiy[ihT)].nameJarg^iece_lcouiit, *largestpiece_sptr); 

} 
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sptrH-; 
icount++; 
epti++; 
ecount++; 

}else{ 

talargestpiece = 0; 
sptrf+; 
lcount++; 
/♦NEW*/ 

hold_this_place_sptr » sptr; 
hold Jhis Jcount = lcount; 
eptr = holdJhis_place_eptr, 
ecount = holdJhis_ecount; 

' } 
} else { 

talargestpiece = 0; 

break; 

) 

talargestpiece = 0; 
eptr = hold_this_place_eptr + 1 ; 
ecount = hold_this_ecount + 1 ; 
hold_thisjplace_eptr = eptr, 

hold_this_ecount = ecount; s 

} 

if(DEBUG_AHGN) §)rintf(stdout,"ALIGN: largest piece match is %d monomers from 
%s\n M ^ilargestpiece,library[ilib] Jiame); 

if(DEBUG_ALIGN) ft)rintf(stdout,"ALIGN: largestpiece_ecount=%d, 
largestpiece_lcount=%d\n n , 

largestpiece^ecountjargestpiecejcount); 

if(nlargestpiece >== smallest_acceptable_piece) { 

if(DEBUG_ALIGN) Q)rintf(stdout, n ALIGN: incoiporatedW'); 
lcount = largestpiecejcount; 
ecount = largestpiece_ecount; 
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while(ecount < (nlargestpiece + largestpiece_ecoimt)) { 

epothilone->aUgnedseque3ice[ecount] = 
Kbrary[ilib].monomersequence[lcx>init]; 

stnyy(epothUone->alignedPKSname[ecoimt],hT^ 

strq>y(epothilone->conte«t[eco\mt]Jdbrar^ xontextflcount]); 

epothilcme->maiked[ecount] =TRUB; 

if(ecount < (nlargestpiece + largestpiece_ecount - 1)) epothilone- 
>boundarytoright[ecoinit] = FALSE; 

lcount++; 
ecount++; 

} 

} 

return (nlargestpiece); 
}/*maximal_adjacent_alignment*/ 

/* . 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the size of the largest maximal adjacent set of monomers inserted. 
PROCEDURE: 

*/ 

int maximal_adjacent_alignment_anddump( 
LIB *epothilone, 
int nwildcard, 

char wildcards[MAXWILD] [MAXLEN], 
LIB "library, 
int ilib, 

int smallest_acceptable_piece) 
{ 
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int ii=0,ij=0,klc=0; 

int ecount=0,lcount=0; 

int elen=0; 

int epothilonelen=0; 

int nlargestpiece=0,tolargestpiece=0; 

int holdjhis Jcount=0, holdjhis_ecount=0; 

int wildcaidmatch^FALSE; 

char *wptr; 

char *largestpiece_sptr,*largestpiece__qptr, 

char *hold_this_place_eptr, *holdJhisjlace_sptr, 

int largestpiece_lcount=0,largestpiece_ecount=0; 

char *sptr, *eptr, *lptr,*bufptr; 

if(DEBUG_WILDCARD) { 
if(nwildcard>0) { 

iprintf(stdouV f maximalj wildcards[0]==%s\n l, ,wildcaids[0]); 

} 

} 

^iintf(stdout,"maximal_adjacent_aKgnment_and_diraip: 
smaUest_acceptablejiece^d\n w ,smallest_acceptable_piece); 

sptr = library[ilib] jnonomersequence; 

eptr = epothilone->monomersequence; 

elen = strlen(epothilone->monomersequence); 

ecount==0; 

lcount=0; 

nlargestpiece=0; 

tnlargestpiece=0; 

holdjhis _j>lace_eptr = eptr; 

hold_this_ecount = ecount; 

while (*eptr S-WH 



sptr = library[iiib] .monomersequence; 
lcount » 0; 
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/♦NEW*/ 
. hold_this_place_sptr = sptr, 
hold_this_lcount = Icount; 
wildcardmatch » FALSE; 

while(*sptr!= , \0 , ){ 

wildcardmatch » FALSE; 
if(epothi]one->inariced[ecount] — FALSE) { 

/» code for wildcards added MAS 05-16-00 */ 

wptr« Blf ; 

if(*eptr = W) { wptr = wildcards[0]; } 
else if(*eptr = 'X') { wptr = wildcards[l]; } 
else if(*eptr = Y 1 ) { wptr = wildcards[2]; } 
else if(*eptr = 'Z 1 ) { wptr = wildcards[3]; } 

wlule^wptrNWOI 

if(*wptr = *sptr) { 

wildcardmatch = TRUE; 
break; 

} 

wptrH-; 

} 

if((wildcardmatch = TRUE) || (*eptr — *sptr)) { 
tnlargestpiece*-* ; 

if(DEBUG_AHGN) fprintf[stdout, "FOUND a match: 
len=%d, epo(%d, %c), Ub[%d] Jiame=%s (%d, %c)\n", 

tnlargestpiece, ecount, *eptr, 

ilib^ibraryfilib] .namejcount, *sptr); 

if(tnlargestpiece > nlargestpiece) { 

nlargestpiece = tnlargestpiece; 
largestpiece__sptr = hold_this_p]ace_sptr, 
largestpiecejcount = holdjhisjcount; 
largestpiece^eptr = hold_thisjplace_eptr, 
largestpiece_ecount = holdjhis_eeount; 
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if(DEBUG_ALIGN) iprintf(stdout, "FOUND a 
largest piece: len=%d, epo(%d, %c), lib(%d, %c)\n n , 

nlargestpiece, largestpiece_ecount, 
*largestpiece_eptr, largestpiece_lcount, *largestpiece_sptr); 

} 

sptrH-; 
lcount++; 
eptrH-; 
ecount-H-; 
} else { 

tnlargestpiece = 0; 

sptrf+; 

lcount++; 

holdjhis _place_sptr = sptr; 
hold Jhisjcount = lcount; 
/*NEW*/ 

eptr = hold_this_place_eptr, 
ecount = hold_this_ecount; 

} 

} else { 

tnlargestpiece = 0; 
break; 

} 

} 

tnlargestpiece = 0; 
eptr = hold_this_place_eptr + 1; 
ecount « holdJhis_ecount + 1; 
hoid_this _place_eptr =» eptr; 
hold_this_ecount = ecount; 
if(DEBUG_ALIGN) { 

Q>rintf(stdout,"incrementing holdJhis_place_eptr=%c, 
holdjMs_ea>imt=%dW\^ 
} 



} 
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if(DEBUG_ALIGN) ^rintf(stdout, n ALIGN: largest piece match is %d monomers finom 
%s\n"^argestpiec^Jftrary[iUb].name); 

if(DEBUG_ALIGN) ft>rintf(stdout, M ALIGN: largqstpiece^ecoiint^/od, 

largestpieceJcount=%d\n", 

largestpiece_ecount,largcstpieceJcount); 

if(nlargestpiece >== smallest_acceptablej)iece) { , 
if(DEBUG_AHGN) iprintf(stdouC ALIGN: incoiporatedW); 

lcount = largestpiece_lcount; 
ecount « largestpiece_ecount; 

^>rintf(stdout,"ALIGN_TARGBT: n ); 
for(ii=0; ii<largestpiece__ecount; ii++) { 
fcrintfl[stdout, " "); 

} 

while(ecount < (nlargestpiece + largestpiece^ecount)) { 

cpothilone->alignedsequence[ecount] = 
Ubrary[ilib]jnonomersequence[lcount]; 

strcpy(epothttoneo>atignedP^ 
strcpy(t#othilon<^on^ 
epothilone->marked[ecount] = TRUE; 

Recount < (nlargestpiece + largestpiece_ecount - 1)) epothilone- 
>boundarytoright[ecount] = FALSE; 

fprintf(stdout/%^kT^^ 

lcount++; 
ecount++; 

} 

foi(ii=ecount; ii<elen; ii++) { 
frrmtfCstdouV 1 "); 
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} 

^iint£(^dout, w %s\n n ja)rary[ilib]jiame); 

} 

return (nlargestpiece); 
}/*maximal_adjacent_aUgnment_and_dump*/ 



/♦ 

PURPOSE: 
INPUT: 
OUTPUT: 
PROCEDURE: 

*/ 

int output Jfresh_alignment( 

LIB *epotemp, 

int riboundary_cutoff) 

{ 

int acount=0,ecount=0; 

int longest_segmentlen=O,current_segmentlen=0; 

char *aptr ) *eptr; 

char boundary[MAXNAMELEN]; 



eptr = epotemp->monomersequence; 
ecount = 0; 

epotemp->nboundary = 0; 
longest_segmentlen = 0; 
current^segmentlen = 0; 

strq>y(boundai7 > epotemp->aUgnedPKSname[ecoinit]); 
while^^N'M)') { 

if(q)otemp->boundarytorightIecount] = TRUE) { 
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epotemp->nboundaryH-; 
if(ttirrait_segmeiitlen > longest_segmentlen) { 
longest jsegmentlen = current_segmentlen; 

} 

current_segmentlen = 0; 

} 

cuxrent_segmentlen++; 

ecount++; 

eptrH-; 

} 

if{cuirent_segmentlen > longest_segmentlen) { 

longest_segmentlen « current_segmentlen; 
} ' 

if(q>otemp->nboundary > nboundary_cutoff) return 1 ; 



eptr = epotemp->monomersequence; 
ecount = 0; 

fprint^stdouV'HIT "); 
while(*eptr != W) { 

if(epotemp->alignedPKSname[ecount] [0] = "W) { 

il(epotemp->boimdarytoright[ecount] = TRUE) { 

^>rintf(stdout, ,f %c:TARG(%s)| " *eptr,epotemp- 

>context[ecount]); 

} else { 

^)rintf(stdout,"%c:TARG(%s) ",*eptr,epotemp- 

>context[ecount]); 

} 

} else { 

ii(epotemp->boimdarytoright[ecount] => TRUE) { 

^)rintf(stdout, n %c:%4s(%s)| "^ep^epotemp- 
>alignedPKSname[ecoiiiit],epotemp->context[ecount]); 
} else { 

fyrintf(stdout,"%c:%4s(%s) n ,*eptr,epotemp- 
>aUgnedPKSname[ecoimt],epotemp->context[ecount]); 
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} ' 

} 

ecountH*; 
eptrH-; 

} 

fprintf(stdout,"%d %d^otemp->nboundaiy4o^ 
fprintf(stdout, mt ); 

eptr = epotemp->monomersequence; 
acount = 0; 
ecount = 0; 
while(*eptr!= , \0 , ){ 

if(epotemp->boundarytoright[ecount] = TRUE) { 
fpriiitf(stdcm^ 

} else { 

fj^tf(stdout/ , %^q>ot^ 

} 

eptrH-; 

acount-H-; 

ecountH-; 

> 

fprtot^stdout,^"); 
return 1; 

}/*output_fresh_alignment*/ 

/* 

*/ 

int getjibrary( 

char *libraryfile, 

LB *Ubrary, 

int inter_modular_db_flagj>fi£) 
{ 
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int ii=0jj=0^k=O>lcount=0; 

int nlib=0; 

char *bxi§)tr,buflMAXBUF]; 

char tmonomersequence[MAXNAMELEN]; 

char *lptr,*tptr; 

FILE *libp; 

for(kk=0; kk < MAXJLIBJ2NTRIES ; kk*+) { 
library[kk] jecursion Jagged = FALSE; 
foifii=0; ii<MAXNAMELEN; ii++) { 

hbrary[kk].name[ii] » W; 

libraryfkk] ,monomersequence[ii] = W; 

libraiy[kk].alignedsequence[ii] = W; 

foKij^;jj<MAXI^;jb"-H-) { 

Hbrary[ldc].aUgnedPKSiiame[ii][ii] = W; 
hl>rarypi].marked[ij] = FALSE; 
library[kk].context|jj][0] = \W; 
Kbrary[kk].context[ij][l] = \V; 
Iibraiy[kk].contextlji][2] = W; 
Hbrary[kk]xontext[ij]t3] = \V; 

) 

} 

library[kk].nboundary = 0; 

} 

/* read in the library from PKS.lib */ 

fprintf(stdouCTRY AGAIN; couldnH open library file: %s\nMibraryfile); 

nlib=0; 

exitO; 

} 

nlib=0; 

while(nlib < MAX_LIB_ENTRIES) { 
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if(NUIJLF^g^ 
bufptr = buf; 

ifl^bufptr = W) continue; 
if[*bufptr !=W){ 

I* 

sscanf[bufptr, M %s %s 

%s n >library[nlib] .namejtmonomersequencejlibrarylnlib] . annotatedsequence); 

# 

*/ 

! 

Iptr = library[nlib].name; 

wbile((*bufptr != ' 0 && (*bufptr != W) && (*bufplr !- W)){ 
*lptr^ = *bufptr++; 

} 

*lptr=W; 

if((*bufptr != \V) && ("bufptr != W)) bufptr++; 



lptr = library[nlib].monomersequence; 

wbile((*bufytr != • *)&& ("bufptr != W) && (*bufptr != V)){ 
/* 

This code specifically deletes inter-modular double bonds when the -d 

option is set 

*/ 

r if(inter_modular_db_£lag_off = TRUE){ 
if(*bufptr !="='){ 

*lptr++ = *bufptr++; 

}else{ 

bufptrf-f; 

} 

}else{ 

♦lptrH- = *bufptr++; 

} 

> 

*lptr=W; 

if((*buft>tr != "NO 1 ) && (*bufi>tr != V)) bmptr++; 
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lptr = Ubrary(nlib].aiinotatedscquence; 
while((*bufptr != ' (*bu$tr != W) && (*bufptr != 

♦IptrH- = *bu$tr++; 

} 

if[(*bufptr != W) && (*bufptr != V)) bufptrH-; 

Q>rintf(stdout,"LIBRARY(%d) %s: 
%s\n"^b4ibrary[iilib].nameJ^brary[nlib].monomersequence); 

Q>rintfi[stdout, ,, LIBRARy(%d) %s: 
%s\n"^b,library[n]ib]jiame,libr^ 



nlib++; 

} 

} 

fcIose(libp); 

for(kk~0; kk<nlib; kk4+) { 



lcotmt = 0; 

lptr « library[kk] .monomersequence; 
while(*lptr !=^{ 

if(lcoimt = 0) { 

library[lck].context[lcount][0] = 

libraiy[kk]xontcxt[lcoiuit][l] = *lptr; 

M>rary[kk].context[]coimt][2] = *(lptr + 1); 

> else { 

if(lcount = (strlen(library[kk] .monomersequence) - 1)){ 
fibrary[kk].context[lcount][0] - *(lptr - 1); 
hl)rary|lck].context[lcount][l] = *lptr, 
Kbrary[kk].context[lcount][2] » 

}else{ 

library[kk].context[lcount][0] = *flptr - 1); 
hT3rai7l>k].context[lcoiint][l] » *Iptr; 
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Hbrary[kk].context[Icount][2] = *Qptx + 1); 

} 

} 

Ubrary[kk].context[lcoimt][3] = W; , 

lptrf+; 

lcount++; 

} 

^)rintfl[stdout, ,r LlBRARY(%d) %s: 
%s\n"^Ubrary[kk]jiame4ibrary[ldc]jnono^ 

^rintf(stdout, n LIBRARY(%d) %s: 
%s\n"Jkk,M>imylTdc]^ 

for(jj^;jj<strlen^^ 

fprmt^stdouVX^ 

} 

$Iintf(stdout, ,l \n f, ); 



return nlib; 
}/*getJfl>rary+/ 



/» 

*/ 

int dump_STARTER_align( 

LIB epothilone, 

int nwildcard, 

char wfldcaids[MAXWIIJD][MA 

) 

{ 

int elen=0; 

int ecoimt=04iold_ecount=0; 

int wildcardmatch=FALSE; 

char *sptr,*eptr,*\vptr, 



elen = strien(epothilone.monomersequence); 
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eptr = epothilone jnonomersequence; 
fprintf(stdout, "ALIGN_TARGBT: 
while(*eptr!='\O f ){ 

^rint^stdou^W^eptr); 

eptrf+; 

} 

§>rintf(stdout, n \a"); 
ecount=0; 

eptr = epothilone jnonomersequence; 
sptr = epothilone.alignedsequence; 
fprintf(stdout, "ALIGN_TARGET: "); 
while(*eptrN'\O f ){ 

wildcardmatch = FALSE; 

wptr = t,n ; 

iflpeptr = X) { wptr = wildcards[0]; } 
else if(*eptr = Y 1 ) { wptr = wildcards[l]; } 
else if(*eptr = *Z) { wptr = wildcards[2] ; } 

while(*wptr !- ■¥)■){ 

if(*wptr==*sptr){ 

wildcardmatch = TRUE; 
' break; 

} 

wptrH-; 

} 

if(wildcardmatch = TRUE) { 
}else{ 

if(*eptr = *sptr){ 

$rintf[stdout,T); 

} else { 

fprintf[stdout, w B ); 

} 

} 
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eptr++; 
sptH-f; 

} 

fprintf(stdout,^ w ); 
fprintf(stdout, "AUGNTARGET: 
eptr = epothilone.monamersequence; 
ecount=0; 

while(*eptr!='\0 , ){ 

if(epothilone.alignedsequence[ecount] = V) 1 ) { 
$rintf(stdout, ,, "); 

} else { 

lprintf(stdouV f %c^epoM 
hold_ecount = ecount; 

eptrf+; 
ecount++; 

} 

fprintftstdout," %s\n",q)othilone.alignedPKSiiame[hold_ecount]); 
Q)rintf(stdout, n STARTER_ALIGN:\ii"); 

}/*dump_STARTER_align*/ 

/* 

*/ 

int reset_epotemp( 
LIB *epotemp, 
LIB epothilone) 
{ 

int jj=0, elen=0; 

elen = strlen(epothilone.monoma^equence); 

strcpy(epotemp->nanie,epothilone.narne); 

stn^y(epotemp->monomere 

8trcpy(epoterap->aligEedsequence,^ 
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epotemp->nboundary = 0; 
for(ij=0; jj< elen; jjH-) { 

strcpy(q)otemp.>alignedPKSnameIjj],epotM 

epotemp->marked[ij] = epothUonejnarkedQj]; 

strq>y(q>otemp->c^^ 

epotemp->boundarytoright[ij] = epothilone.boundarytorightQj]; 

} 

}/*reset_epotemp*/ 

Thus, the present invention provides a useful means to generate new PKS genes and 
corresponding enzymes to produce polyketides. The invention having now been described by 
way of written description and examples, those of skill in the art will recognize that the 
invention can be practiced in a variety of embodiments and that the foregoing description and 
examples are for purposes of illustration and not limitation of the following claims. 



WO 01/92991 



99 



PCT/US01/17352 



WHAT IS CLAIMED IS: 

1 . A method for representing the structure of a polyketide produced by a modular 
polyketide synthase, said method comprising the steps of: 

(a) defining a set of monomer units of which said polyketide is 
composed, 

(b) assigning an alphanumeric symbol or symbols to each different 
monomer unit in said set, 

(c) identifying one or more monomers in said set that is present in said 
polyketide, and 

(d) composing a string of said symbols ordered in a manner reflecting 
the order in which said monomers occurs in said polyketide, wherein said string 
of symbols represents the structure of said polyketide. 

2. The method of claim 1, wherein said monomer set comprises two-carbon unit monomers, 
wherein a first carbon of said unit is substituted with hydrogen or methyl, and a second carbon 
of said unit is substituted with oxygen, hydroxy, or hydrogen, and said two carbon unit 
comprises either a single or a double bond between said first and second carbons. 

3. The method of claim 2, wherein said monomer set additionally comprises one or more 
members selected from the group consisting of two carbon unit monomers in which said first 
carbon is substituted with hydroxy, methoxy, or ethyl; a moiety corresponding to an amino acid 
or amino acid derivative incorporated into a PKS by a non-ribosomal peptide synthase; a moiety 
corresponding to a structure incorporated into a polyketide by an AMP ligase or a CoA ligase; 
and a moiety corresponding to a structure corresponding to a structure in a polyketide after 
modification by a polyketide modification enzyme. 
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4. The method of claim 2 wherein the set of monomer unit and corresponding symbol 
comprises: 

OH OH OH OH 

= A; ^A^=B; J^^ = C; J^^=D; 

i 



OH OH 

= E; X / = F 



o 

A/ =l : "Y=J; ^Y= K ' ^^ = L; 




= M; and /^ V / = N 



5. The method of claim 4 wherein the set of monomer unit further comprises a 
miscellaneous monomer that is assigned the symbol Q. 

6. The method of claim 4 wherein the set of monomer unit and corresponding symbol 
further comprises 

i 

9" OH OH 

^Y^a'; A^ =B '; A r /= c > 

R R R 



OH 



H'; 



= J,; K ' ;and -"Y" = M " 

R R p 
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7. A database of polyketides, in which each said member is represented by a string of 
alpha-numeric symbols, wherein said symbols represent structural subtmits of said polyketide, 
and said string represents the order in which such subumts occur in said polyketide. 

8. The database of claim 7 that includes at least 1 00 different polyketides. 

9. The database of claim 7 wherein each said member is represented by a CHUCKLES 
string. 

10. The database of claim 7 wherein each said member is represented by an annotated 
CHUCKLES string. 

1 1 . The database of claim 7 wherein the symbol and its corresponding structural subunit are 
selected from the group consisting of 

OH "OH OH OH 





; G = 




;h= 





and Q for a miscellaneous monomer. 
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12. The database of claim 7 wherein the symbol and its corresponding structural subunit are 
selected from the group consisting of 

OH OH OH OH 

a=A/;b=A.^;c=A/; d = 



OH OH 

E = 



; L 

QH OH 




R 

OH OH 



D'= A/;G = H ' = 



C'= 

R R 



j'= ; K = M ' = s*\"> 

R R R 

and Q for a miscellaneous monomer. 



13. A database of polyketides, in which each said member is represented by a linearized 
representation of said polyketide. 

14. A method of designing a PKS gene capable of producing a desired polyketide, which 
method comprises: 

(a) defining a string of alphanumeric symbols representing the structure of said 
polyketide, 

(b) comparing said string to a database of strings of alphanumeric symbols 
representing polyketides produced by PKS genes, 
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(c) identifying common elements in said string representing the structure of said 
polyketide with elements in said strings in said database, and 

(d) generating one or more new strings from elements identified in step (b) that 
match said string representing the structure of said polyketide, wherein said new string defines a 
PKS gene capable of producing said polyketide. 

15. The method of claim 14, wherein all possible PKS genes encoding a desired polyketide 
from said database are generated and displayed. 

1 6. The method of claim 14, wherein said new strings generated in step (d) are rated and 
displayed in an order based on one or more parameters. 

17. The method of claim 16, wherein said parameters are selected from the group consisting 
of number of non-native module interfaces and number of non-native protein interfaces. 
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Figure 1 
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CHUCKLES: ADGJDD 
SMILES: 

Cl(=0)-[C@Hl(C)[C®@H](OHMC@®H](C)[C@@Hl(OH>- 

[C@@H](C)C-[C@@H1(QC(=0)-[C@H1(Q[C@@H1(C)- 

[C@@H](Q[C@@H](CQ01 



Figure 3 
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Figure 4A 
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Figure 4C 
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